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SECTION  1 
INTRODUCTION 


1.1  PROBLEM  SIGNIFICANCE 

The  recent  advances  in  very  large  scale  integration  technology  and  communica¬ 
tion  networks  have  generated  a  great  deal  of  interest  in  the  study  of  multiple- 
instruction,  multiple-data  stream  (MIMD)  multi-processor  systems.  These  systems  are 
characterized  by:  (1)  a  number  of  autonomous  processing  elements  interconnected  by 
a  high-bandwidth  communication  network,  (2)  a  distributed  operating  system,  and  (3) 
highly  concurrent  computation  brought  about  by  the  decomposition  of  an  application 
algorithm  into  several  distinct,  cooperating  tasks  [Kuhl  and  Reddy,  1986],  In  addition, 
the  multiplicity  of  processing  elements  in  a  multi-processor  system  can  be  exploited  to 
improve  system  reliability,  and  to  provide  graceful  degradation  in  the  presence  of 
hardware  and  software  faults.  The  modularity,  flexibility,  and  reliability  of  these  sys¬ 
tems  make  them  attractive  to  many  areas  of  real-time  applications,  such  as  large  scale 
defense  applications,  flight  control  systems,  transportation  systems,  and  manufacturing. 
These  applications  typically  have  enormous  computational  and  storage  requirements, 
real-time  processing  constraints,  fault-tolerance  and  data  security  requirements  [Pat- 
tipati  et  al.,  1986]. 

One  of  the  major  issues  in  the  efficient  operation  of  a  MIMD  multi-processor  is 
the  mapping  of  an  application  algorithm  onto  various  constituent  processors  of  the  sys¬ 
tem.  In  order  to  take  advantage  of  concurrent  processing,  it  is  desired  to  achieve  a 
minimum  execution  time  (completion  time)  of  the  algorithm  with  a  minimum  number 
of  processing  elements  via  efficient  algorithm- architecture  mapping.  There  are  four 
important  factors  that  contribute  to  the  completion  time  of  an  algorithm  on  a  MIMD 
multi-processor:  (1)  task  partitioning,  (2)  task  allocation  and  sequencing,  (3)  the  inter¬ 
connection  topology  and  the  capacity  of  each  communication  link,  and  (4)  the  speed  of 
each  processor.  The  partitioning  problem  refers  to  the  selection  of  the  level  of 
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granularity  used  to  represent  the  task  of  an  application  algorithm.  Task  partitioning 
determines  the  amount  of  computation  required  by  each  task,  the  precedence  relations 
among  the  tasks  of  the  algorithm,  and  the  amount  of  data  transmitted  between  each 
pair  of  tasks.  The  allocation  and  sequencing  strategy  determines  the  assignment  of  the 
set  of  tasks  to  member  processors  and  the  order  of  execution  of  the  tasks  allocated  to 
each  processor.  The  inter-task  communication  is  constrainted  by  the  interconnection 
topology,  and  the  precedence  relations  among  the  tasks  of  the  algorithm  impose  syn¬ 
chronization  requirements,  i.e.,  a  task  can  not  begin  executing  until  all  the  tasks 
preceding  it  have  been  completed. 

In  this  report,  we  are  concerned  with  the  problem  cf  mapping  Battle 
Management/Command,  Control,  and  Communication  (SAf/C3)  algorithms  for  multi¬ 
target  tracking  and  weapon-target  assignment  onto  a  non-homogeneous  MIMD  multi¬ 
processor  to  minimize  the  completion  time  (or  equivalently,  maximize  speedup).  We 
view  the  mapping  problem  as  one  of  characterizing  a  BMlCi  algorithm  and  the  multi¬ 
processor  system  as  graphs,  and  subsequently  assigning  and  sequencing  the  nodes  of 
the  algorithm  graph  to  the  nodes  of  the  processor  graph  to  minimize  the  completion 
time  of  the  algorithm,  subject  to  reliability,  storage,  and  security  constraints.  A  flow 
chart  of  the  mapping  process  is  shown  in  Fig.  1-1.  A  flM/C3  algorithm  is  partitioned 
into  tasks,  and  the  communication  among  tasks  is  represented  as  a  directed  acyclic 
graph.  These  graphs  are  termed  task  graphs,  problem  graphs,  or  computation  flow 
graphs.  A  multi- processor,  on  the  other  hand,  is  characterized  by  an  undirected  graph 
that  depicts  the  interconnection  topology  of  the  architecture.  These  graphs  are  termed 
the  processor  graphs,  system  graphs,  or  computation  resource  graphs.  In  the  5A//C3 
application,  the  task  and  processor  graphs  are  time-varying  due  to:  (1)  the  dynamic 
nature  of  communication  among  tasks,  (2)  lack  of  a  priori  information  on  the  data 
dependencies  among  tasks,  (3)  the  stoJiaMie  nature  of  the  time  between  the  execution 
of  a  given  task,  and  (4)  failures  and/or  on-line  repair  of  the  computational  resources  of 
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the  multi-processor  architecture.  As  discussed  below,  various  assumptions  on  the  time- 
dependence  of  the  task  and  processor  graph  parameters  lead  to  static,  quasi-static,  and 
dynamic  mapping  problems.  The  resulting  optimization  problem  of  allocating  tasks  to 
processors,  and  sequencing  the  tasks  on  member  processors  is  solved,  and  the  perfor¬ 
mance  measures  such  as  the  processor  utilization,  completion  time,  speedup,  and  com¬ 
munication  delay  are  computed.  If  the  results  are  not  satisfactory,  an  alternative  task 
division  or  a  new  multi-processor  architecture  may  be  tried  out,  and  the  analysis 
repeated. 


DONE 


FIGURE  1-1:  OVERVIEW  OF  ALGORITHM-ARCHITECTURE 
MAPPING  PROBLEM 
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1.2  A  TAXONOMY  OF  MAPPING  PROBLEMS 

The  mapping  problems  can  be  classified  based  on  the  following  two  elements:  (1) 
the  time-dependence  of  the  task  and  processor  graph  parameters,  and  (2)  the  level  of 
information  on  the  task  graph  parameters.  For  the  multi-processor  configuration,  the 
amount  of  task  computation  and  communication  among  tasks  can  be  stationary  (static 
or  time-invariant)  or  dynamic.  Dynamic  behavior  refers  to  a  situation  wherein  the  task 
graph  and/or  processor  graph  parameters  are  time  varying,  while  in  the  static  case  they 
remain  fixed.  This  in  turn  gives  rise  to  static  and  dynamic  mapping  methods.  The 
dynamic  maoping  methods  are  considerably  more  complex  than  the  static  mapping 
problems,  since  they  involve  multi-stage  optimization,  wherein  the  effects  of  current 
mapping  decisions  on  all  future  changes  in  the  task  and  processor  graphs  must  be 
considered.  In  some  cases,  the  computation  and  communication  vary  slowly  with  time, 
and,  hcr.ce,  can  be  assumed  to  be  static  over  relatively  long  time  intervals.  We  call 
such  task  graphs  quasi-static,  and  the  corresponding  mapping  methods  are  termed 
quasi-static  mapping  methods.  In  this  case,  we  solve  a  series  of  static  mapping  prob¬ 
lems  periodically  by  taking  into  account  the  migration  cost  incurred  in  changing  the 
current  mapping  to  the  next  mapping.  The  quasi-static  mapping  methods  are  also  use¬ 
ful  when  the  task  graphs  are  relatively  stationary,  but  the  multi-processor  system  might 
undergo  configuration  changes  due  to  failures  and  recovery.  In  this  case,  we  solve  a 
series  of  static  mapping  problems  for  each  possible  multi-processor  configuration  or  a 
small  set  of  aggregated  configurations.  As  mentioned  earlier,  the  data  dependencies  in 
BM iC*  algorithms  fall  under  the  dynamic  category,  but  can  be  approximated  by  a 
quasi-static  model. 

The  task  parameters  can  be  either  deterministic  or  probabilistic.  In  the  determinis¬ 
tic  situation,  the  amount  of  task  computation,  and  the  amount  and  frequency  of  data 
transfers  among  tasks  are  perfectly  known.  On  the  other  hand,  the  probabilistic 
knowledge  of  the  task  graph  parameters  can  assume  one  of  two  forms  :  "complete 
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information"  or  "partial  information".  Under  complete  information,  the  probabilistic 
knowledge  of  the  task  parameters  is  represented  by  a  known  probability  distribution 
function  with  known  parameters  of  the  distribution.  For  example,  if  the  amount  of  data 
transferred  between  two  tasks  is  exponentially  distributed  with  a  given  mean,  then  the 
information  is  considered  complete.  Under  partial  information,  the  form  of  the  proba¬ 
bility  densities  of  the  task  parameters  is  known,  but  the  parameters  of  the  distribution 
have  to  be  estimated  (learned)  on-line.  For  example,  we  may  know  that  the  amount  of 
data  transferred  between  two  tasks  is  exponentially  distributed,  but  the  mean  is  unk¬ 
nown. 

The  classification  discussed  above  is  summarized  in  Fig.  1-2.  Our  primary  focus 
in  this  report  is  on  the  static,  deterministic  mapping  problem  that  explicitly  considers 
the  precedence  restrictions  on  task  execution,  data  communication  delay,  and  redun¬ 
dancy,  security,  and  storage  constraints.  Future  research  will  address  the  quasi-static 
and  dynamic  mapping  problems. 
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FIGURE  1-2  :  A  TAXONOMY  OF  MAPPING  PROBLEMS 
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1.3  SUMMARY  OF  RESULTS 

The  static,  deterministic  mapping  problem  is  NP-complete,  i.e.,  the  memory  and 
computational  requirements  of  an  optimal  algorithm  grow  exponentially  with  the 
number  of  tasks  and  the  number  of  processors  [Horowitz  and  Sahni,  1978;  Garey  and 
Johnson,  1979;  Bokhari,  1981].  Therefore,  all  practical  algorithms  involve  the  use  of 
heuristics  to  subdue  the  computational  explosion  of  the  optimal  mapping  algorithms. 
We  develop  four  algorithms  to  solve  the  mapping  problem:  (1)  greedy  heuristic  algo¬ 
rithm,  (2)  pair-wise  exchange  algorithm,  (3)  optimal  A‘  algorithm,  and  (4)  e-optimal 
A'c  algorithm. 

The  greedy  heuristic  is  a  two-stage  algorithm.  The  first  stage  determines  the  order 
of  task  execution/allocation,  and  the  second  stage  chooses  the  best  processor  that  com¬ 
pletes  each  task  in  the  sequence  at  the  earliest  possible  time.  The  order  of  task  execu¬ 
tion  is  determined  using  the  notion  of  a  level  of  a  node  in  the  task  graph.  Intuitively, 
the  level  of  a  node  i  in  the  task  graph  is  the  critical  path  length  from  the  terminal  node 
of  the  task  graph  to  node  i.  When  the  multi-processor  is  made  up  of  processors  with 
different  apeeds  we  assume  that  every  task  on  a  path  from  the  terminal  node  to  node  i 
is  executed  on  the  fastest  processor.  The  execution  order  of  the  tasks  is  based  on 
decreasing  node  levels  on  the  premise  that  the  tasks  with  longer  paths  (or  larger  level) 
should  be  completed  as  soon  as  possible  [Hu,  1961;  Kohler,  1975].  The  results  of 
computational  experiments  on  several  hundred  random  graphs,  the  weapon-target 
assignment,  and  the  multi-target  tracking  algorithms  have  shown  that  the  greedy 
heuristic  provides  optimal  mapping  in  over  75%  of  the  test  cases,  but  can  be  in  error 
by  as  much  as  230%  from  the  optimal  completion  time  in  some  test  cases. 

In  order  to  improve  the  performance  of  the  greedy  heuristic,  we  developed  a 
second  mapping  algorithm  using  the  concept  of  pair-wise  exchange.  In  effect,  the  algo¬ 
rithm  exchanges  the  execution  order  of  all  possible  pairs  of  tasks  in  the  sequence, 
while  satisfying  the  precedence  relations  of  the  task  graph.  This  algorithm  has  been 
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found  to  be  optimal  in  over  85%  of  the  test  cases  examined,  and  the  average  error  in 
completion  time  was  less  than  5%  of  the  optimal  for  all  the  test  cases.  This  is  an 
efficient  and  useful  algorithm  for  large  scale  mapping  problems,  wherein  an  optimal 
solution  is  hard  or  impossible  to  find  in  polynomial  time. 

The  performance  evaluation  of  the  heuristic  and  pair-wise  exchange  algorithms  is 
accomplished  using  the  optimal  mapping  algorithm,  based  on  the  heuristic  A*  algo¬ 
rithm  as  a  benchmark  [Nilsson,  1980;  Pearl,  1984],  The  heuristic  evaluation  function 
(HEF)  required  by  the  A*  algorithm  is  the  level  of  each  node  on  the  task  graph.  This 
HEF  can  be  shown  to  be  admissible,  i.e.,  it  is  a  lower  bound  on  the  optimal  cost-to-go, 
which  ensures  that  the  A*  algorithm  provides  an  optimal  mapping.  However,  for  large 
problems,  the  computational  requirements  of  the  A*  algorithm  are  prohibitive. 

In  order  to  subdue  the  combinatorial  explosion  of  the  a‘  algorithm  on  large  scale 
mapping  problems,  we  developed  an  Aj  algorithm  that  guarantees  that  the  completion 
time  is  within  (1+e)  of  the  optimal  completion  time.  The  tradeoff  between  computa¬ 
tional  complexity  and  optimality  of  the  A  j  algorithm  is  controlled  by  the  choice  of  e. 

1.4  MAPPER  SOFTWARE  PACKAGE 

The  mapping  algorithms  have  been  integrated  into  an  interactive  computer 
software  package,  termed  MAPPER,  for  the  analysis  and  performance  evaluation  of 
alternative  SA//C3  algorithms  and  multi-processor  architectures.  A  functional  structure 
of  MAPPER  is  shown  in  Fig.  1-3.  It  consists  of  four  functional  modules:  a  graphical 
user  interface,  data  translator,  algorithm  driver,  and  the  algorithms.  The  user  interface 
provides  a  mouse-driven  graphical  environment  for  users  to  draw  and  edit  task,  pro¬ 
cessor  graphs,  and  enter  their  parameters.  The  results  of  MAPPER  are  displayed  in  the 
form  of  a  Gantt  chart,  along  with  other  useful  performance  measures  like  the  speedup 
and  the  utilization  of  each  processor.  The  data  translator  portion  of  MAPPER  converts 
the  user’s  view  of  task  and  processor  graphs  into  algorithm-level  inputs.  The  algorithm 
driver  invokes  the  appropriate  algorithm,  based  on  the  user’s  mouse-driven  commands. 

i 
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Finally,  the  algorithm  portion  of  the  MAPPER  consists  of  the  four  mapping  algorithms 
discussed  earlier:  greedy  heuristic,  pair-wise  exchange,  -4*,  and  A\  algorithms.  The 


MAPPER  software  is  hosted  on  a  SUN  workstation. 


FIGURE  1-3  : 


A  SOFTWARE  ENVIRONMENT  FOR 
ALGORITHM- ARCHITECTURE  MAPPING 
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1.5  ORGANIZATION  OF  THE  REPORT 

In  section  2,  we  provide  a  mathematical  formulation  of  the  static,  deterministic 
mapping  problem,  and  discuss  its  relation  to  previous  scheduling  techniques. 
Specifically,  we  show  that  simplified  formulations  lead  to  the  following  problems:  (1) 
scheduling  independent  tasks  on  parallel  identical  processors,  (2)  scheduling  tree- 
structured  task  graphs  on  parallel  identical  processors  wherein  each  task  requires  unit 
processing  time  and  zero  communication  delay,  and  (3)  all  previous  mapping  formula¬ 
tions  discussed  in  the  literature.  Also  included  in  this  section  are  the  drawbacks  of  the 
previous  approaches,  and  the  key  features  that  distinguish  our  formulation  from  those 
of  the  earlier  approaches. 

In  section  3,  we  derive  the  key  mapping  equation  that  forms  the  basis  of  all  the 
four  mapping  algorithms  developed  in  this  report.  We  also  derive  the  four  algorithms 
and  illustrate  their  performance  on  a  simple  example. 

Section  4  provides  four  sets  of  computational  experiments  to  demonstrate  the  per¬ 
formance  of  the  mapping  algorithms.  The  first  set  considers  hypothetical  examples 
gleaned  from  the  literature.  The  second  set  of  experiments,  which  deal  with  several 
hundred  random  graphs  over  a  wide  range  of  computation/  communication  ratios,  and 
were  used  to  critically  assess  the  performance  of  the  heuristic  and  pair-wise  exchange 
algorithms.  The  third  and  fourth  set  of  experiments  are  related  to  the  weapon-target 
assignment  and  the  multi-target  tracking  algorithms. 

Section  5  provides  a  summary  of  the  research  accomplishments  and  future 
research  plans.  Appendix  A  describes  the  design  methodology  for  the  user  interface  of 
MAPPER  software  package.  Finally,  Appendix  B  contains  a  user  manual  for 
MAPPER. 
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SECTION  2 

STATIC  MAPPING  PROBLEM 
FORMULATION  AND  PREVIOUS  APPROACHES 

2.1  PROBLEM  FORMULATION 

We  formulate  the  problem  of  mapping  a  static  task  graph  onto  a  static  processor 
graph  as  one  of  minimizing  the  completion  time  of  the  task  graph  subject  to  (1)  con¬ 
straints  on  the  memory  available  at  each  processor,  (2)  the  replication  level  of  each 
task,  and  (3)  security.  The  task  graph  (also  termed  a  problem  graph  or  computation 
flow  graph  (CFG))  is  a  directed,  acyclic  graph,  G,  =  ( V,  ,  E,  ),  where  V,  = 
{/:»'*  1,  2,  •  •  •  ,N)  is  the  set  of  vertices  (nodes)  denoting  the  tasks  of  the  application 
algorithm,  and  E,  =  {  <ij> :  ij  =  1, 2, N  \  i  *  j  }  is  the  set  of  directed  edges  (arcs, 
links)  representing  the  inter-task  communication  between  pairs  of  tasks  i  and  j ,  and 
the  partially  ordered  constraint  that  task  i  must  precede  task  j.  Each  node  i  of  the  task 
graph  is  parameterized  by  the  3-tuple  ( s,  ,  w,  ,  r,  ),  where  s,  is  the  service  demand  of 
task  i  measured  in  terms  of  millions  of  instructions,  is  the  memory  requirement  of 
task  i  measured  in  Kilo  bytes  (Kb)  and  r,  is  the  replication  level  of  task  i  for  fault- 
tolerance.  Each  directed  edge  <ij>  of  the  task  graph  is  parameterized  by  v,, ,  where  v„ 
denotes  the  amount  of  data  transmitted  between  tasks  i  and  j  measured  in  terms  of 
bits.  We  assume,  without  loss  of  generality,  that  the  task  graph  G,  consists  of  a  start 
(source)  node  and  a  terminal  (sink)  node,  i.e.,  the  task  graph  G,  is  such  that  each  node 
can  be  reached  if  we  go  forward  from  the  start  node  or  go  backward  from  the  terminal 
node.  If  the  task  graph  does  not  have  a  start  node  and/or  a  terminal  node,  a  dummy 
stan  node/terminal  node  can  always  be  added  to  the  task  graph  such  that  the  number 
of  instructions  of  the  dummy  node  and  the  data  transmitted  from  the  dummy  node  to 
other  nodes  of  the  task  graph  (and  vice  versa)  are  zero.  For  notational  convenience, 
we  assume  that  the  terminal  task  corresponds  to  node  N  of  the  task  graph,  G,. 
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In  the  same  vein,  the  processor  graph  is  an  undirected  graph,  Gp  =  ( vp  ,  Ep  ), 

where  Vp  =  {  p  :p  -  1,2 . W  }  is  the  set  of  vertices  denoting  the  processors,  and 

Ep  *  {  (p  .  ?  ) :  p.  q  *  1. 2 Af;  p  *  <7  }  is  the  set  of  undirected  edges  depicting  the 

communication  links  among  processors.  Each  node  of  the  processor  graph  is 
parameterized  by  the  2-tuple  ( \ip  ,  Rp  ),  where  u?  is  the  service  rate  of  processor  meas¬ 
ured  in  terms  of  millions  of  instructions  per  second,  and  Rp  is  the  memory  capacity  of 
processor  p  in  Kb.  Each  edge  (p,q)  of  the  processor  graph  Gp  is  parameterized  by  the 
link  capacity,  c„  measured  in  bits  per  second.  We  assume  that  the  data  communication 
between  a  pair  of  processors  (k,q)  follows  the  shortest  path  {k  , ,k2,  •••  ,kH,q) 
where  k  ,kx  ,k2,  •  •  •  ,q  are  the  nodes  on  the  shortest  path.  We  assume  that  the 
task  sequencing  at  each  processor  is  nonpreemptive.  Furthermore,  we  assume  that  each 
node  of  the  processor  graph  contains  an  execution  processor  and  a  communication  pro¬ 
cessor  so  that  task  execution  and  data  communication  can  be  serviced  simultaneously 
at  the  same  node.  The  reliability  constraint  is  modeled  as  a  redundant  execution  of 
each  task  i  at  r,  (  21  )  distinct  processors.  Finally,  the  security  constraint  implies  that 
each  task  i  can  be  executed  only  at  a  certain  set  of  distinct  processors,  S, ,  i  z  v, . 
Clearly,  l  5,  1  2:r,  for  all  i  z  V,,  where  l  5,  l  is  the  cardinality  of  S,  .  That  is,  the 
number  of  distinct  processors  to  which  a  task  i  can  be  allocated  must  at  least  equal  the 
replication  level,  r,  . 

Note  that  in  the  case  without  any  constraint,  i.e.,  no  memory,  security,  and 
storage  constraints,  the  node  parameter  of  the  task  graph  can  be  represented  by  the  ser¬ 
vice  demand  of  task  »',  ,  while  the  node  parameter  of  the  processor  graph  can  be 

represented  by  the  service  rate  of  processor  q ,  p, . 

Formally,  a  mapping  is  a  partition  of  the  task  set  V,  =  {1,  2,  into  M  ordered 
sets  T],  72,  ...Jm '• 


V 

and 


1  <7i  .  <72 .  '  ‘  '  .  <7* 


u  rf- 

\<^<M 


(2-la) 

(2-lb) 
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that  satisfy  the  precedence  relationships  of  the  task  graph.  Note  that  the  ordered  sets 
T\  ,T2 ,  •  ■  ■  ,Tn  are  not  disjoint  due  to  redundant  execution  of  tasks  at  multiple  pro¬ 
cessors.  We  find  it  convenient  to  define  the  complementary  distinct  element  sets  P, 
via  Pj,  =  {  q  :  i  t  Tq  and  no  element  is  repeated  } ,  i  zV,.  That  is,  /*,  is  the  set  of  distinct 
processors  to  which  task  i  is  allocated.  In  addition,  let  J5,  denote  the  set  of  immediate 
parents  of  task  /  in  the  task  graph,  G,.  Then  the  static,  deterministic  mapping  problem 
can  be  stated  succinctly  as  follows: 

min  CTn  (Tx,T2.  •  •  •  ,  T„  )  .  (2-2) 

•  T\  T  2  ,  •  •  •  ,  Tu  r 

subject  to: 


I 

mi  <i  Rq  ;  q  £  Vp  ( memory  constraint )  , 

(2-3) 

i  z  Tq 

1  P ,  1 

=  r,  ;  i  zV,  ( reliability  constraint )  , 

(2-4) 

and  Pi 

c  Si  ,i  z  V,  ( security  constraint ) 

(2-5) 

where  CTN  (Tx  ,T2,  •  •  •  ,  TM  )  is  the  completion  time  of  the  terminal  task  N  (or  make 
span)  under  the  mapping  (Tx  ,T2,  •  •  •  ,  TM  ).  Note  that  the  mapping  must  satisfy  the 
precedence  constraints,  viz.,  task  i  cannot  begin  executing  on  processor  q,  q  e  Pit  until 
the  data  from  each  task  of  the  parent  set  p,  is  available  at  processor  q,  i  zV,,  We  will 
provide  a  precise  mathematical  characterization  of  the  precedence  constraint  in  section 
3.1. 

Once  the  mapping  (Tx ,  T2 ,  ■  ■  ■  ,  Tu)  is  known,  the  speedup  can  be  obtained  via  : 

speedup  =  (  £  r,j,  ]  /  [p7  CTN(TX  ,T2,  ■  •  •  ,  TM)]  ,  (2-6) 

ieV, 

where  \if  is  the  service  rate  of  the  fastest  processor.  The  processor  utilization,  Uq,  is 
obtained  from  : 


U,  =  1  Z  ~VCTn(Tx,T2,  ■  ,rM). 

izT„  ^ 


(2-7) 
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The  utilization  of  links  can  be  derived  in  a  similar  manner. 

Illustrative  Example: 

To  illustrate  the  problem  formulation,  consider  the  task  and  processor  graphs 
shown  in  Fig.  2-1.  The  task  graph  is  represented  by  V,  =  {1,  2,  3,  4),  E,  =  {  <1,2>, 
<1,3>,  <2,4>,  <3,4>  }.  The  node  parameters  are:  (r, .  m,  ,  r,)  =  (1,  5,  1),  ( s2 ,  m2 ,  r2)  = 
(3,  10,  2),  (s3 ,  m3 ,  r3)  =  (4,  20,  1),  and  (s4 ,  m4 ,  r4)  =  (2,  10,  1).  The  edge  parameters 
of  the  task  graph  are:  v12=l,  vn=2,  v24=2,  and  vw=l.  Similarly,  the  processor  graph  is 
represented  by:  Vp  =  {1,  2}  and  Ep  =  {  (1,2)  ).  The  node  parameters  of  the  processor 
graph  are:  (n,  , /?,  )=  (1,100)  and  (p2  ,R2  )=  (1,  200).  Since  there  is  a  single  link,  the 
edge  parameter  is:  c12=l.  Let  the  security  constraint  be  S3={  1,2) ,  S^={  1,2),  S3={2),  and 
S4={1,2).  Then  T^  =  {1,2}  and  r2=  {3,2,4}  is  a  feasible  mapping.  With  this  mapping, 
we  have  /’^{l},  /* 2=  { 1 ,2 } ,  />3=(2},  />4={2),  and  the  completion  time  CT4(7’1,7’2)  =12. 
The  speedup  for  this  mapping  is,  SP-1.08.  The  utilizations  of  the  processors  are: 
i/i=  0.33  and  U2  =  0.75. 


FIGURE  2-1:  AN  ILLUSTRATIVE  EXAMPLE  OF 
TASK  AND  PROCESSOR  GRAPHS 


To  illustrate  the  problem  complexity,  consider  the  case  where  there  are  no  secu¬ 
rity  or  memory  constraints,  i.e.,  each  task  can  be  executed  on  any  processor.  Then  the 

r,-l 

total  number  of  different  allocations  for  task  i  is  n  ( M-l ).  The  execution  order  of  the 

1=0 
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tasks  is  upper-bounded  by  N\.  Therefore,  the  total  number  of  different  mappings  in  the 
N  '.  -1 

worst  case  is  N\  n  [  n  (M-l)].  When  r,=l,  for  1  zi  zN,  the  total  number  of  different 
i= 0  1=0 

mapping  in  the  worst  case  is  N\Mn . 

Indeed,  the  mapping  problem  posed  above  is  NP-complete  [Horowitz  and  Sahni, 
1978;  Garey  and  Johnson,  1979;  Bokhari,  1979],  which  means  that  an  optimal  algo¬ 
rithm  for  the  static,  deterministic  mapping  problem  with  a  run-time  bound  that  is  a 
polynomial  function  of  tasks  and  the  number  of  processors  exists  if,  and  only  if,  all 
combinatorial  optimization  problems,  including  the  traveling  salesman,  maximum 
clique,  and  the  satisfiability  problems  can  be  solved  in  polynomial  time  [Cook,  1971; 
Karp,  1972].  The  evidence  indicates  that  in  all  likelihood  any  problem  which  is  NP- 
complete  cannot  be  solved  by  an  algorithm  of  polynomial  time  complexity.  Therefore, 
all  practical  algorithms  exploit  the  use  of  heuristics  to  reduce  the  computational  bur¬ 
den.  In  the  following  subsection,  we  discuss  the  relationship  of  the  mapping  problem 
with  previous  scheduling  problems  in  the  literature  in  order  to  indicate  its  generality, 
and  to  provide  a  basis  for  developing  the  heuristic  and  optimal  mapping  algorithms  of 
section  3. 

2.2  RELATION  TO  PREVIOUS  SCHEDULING  TECHNIQUES 

2.2.1  SCHEDULING  INDEPENDENT  TASKS 

This  problem  is  concerned  with  nonpreemptive  scheduling  of  N  independent 
tasks  on  M  processors.  Thus,  we  assume  that  the  task  graph  is  a  vertex  graph  (i.e.,  a 
graph  with  isolated  vertices  and  no  edges)  with  no  constraints  on  memory,  reliability 

and  security.  The  execution  time  of  task  i  on  processor  j  will  be  denoted  by  /,7 

yielding  an  N  x  M  nonnegative  matrix  of  processing  times.  Since  there  are  no  reliabil¬ 
ity  constraints,  a  schedule  for  M  processors  is  a  partition  of  the  task  set  v,  = 
{1,2,  3 . H }  into  M  disjoint  ordered  sets  7,,  t2,  It  is  shown  that,  for  M  2  2,  the 
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problem  of  obtaining  a  schedule  to  minimize  the  makespan  (completion  time)  or, 
equivalently,  tc  maximize  the  speedup,  is  NP-complete  [Horowitz  and  Sahni,  1978; 
Gary  and  Johnson,  1979].  This  puts  added  importance  on  the  development  of  heuristic 
algorithms. 

An  example  of  a  heuristic  algorithm  for  scheduling  independent  tasks  on  identical 
processors  (i.e.,  p,  =  p  ,  j  =  l  .  2  , ...  M)  is  the  list  scheduling  (LS)  strategy.  The  LS 
strategy  schedules  tasks  according  to  a  given  priority  list  of  tasks,  and  at  each  step  the 
first  available  task  on  the  list  is  assigned  to  a  processor  with  the  currently  earliest  com¬ 
pletion  time.  A  natural  question  then  is:  how  good  are  the  solutions  generated  by  the 
suboptimal  LS  algorithm?  Such  a  question  is  usually  answered  by  assessing  the  worst 
case  accuracy  of  the  suboptimal  algorithm.  It  has  been  proved  [Graham,  1966;  Syslo, 
Deo,  and  Kowalik,  1983]  that  the  list  scheduling  algorithm  generates  a  solution  which 
satisfies: 

C.^ALS)<.{2-^  )C;„  .  (2-8) 

where  C^iLS)  is  the  completion  time  with  the  LS  strategy,  and  C'mtx  is  the  optimal 
completion  time.  Thus,  the  completion  time  of  the  LS  strategy  can  be  worse  by  at 
most  a  factor  of  two  than  the  optimal  completion  time. 

Another  example  is  to  form  a  priority  list  according  to  nonincrcasing  order  of 
processing  times  r,.  The  list  scheduling  algorithm  applied  to  such  an  ordering  is  called 
the  largest  processing  time  (LPT)  algorithm.  The  LPT  algorithm  generates  a  solution 
which  satisfies: 

C™,  (LPT)  Z  (|  -  ~)CL,  •  (2-9) 

This  means  that  the  completion  time  of  an  LPT  schedule  is  at  most  33%  worse  than 
that  of  the  optimal  solution. 

In  spite  of  these  worst-case  bounds,  LPT  rule  behaves  very  well  on  random  prob¬ 
lems.  In  one  experiment  [Coffman,  Garey,  and  Johnson,  1978],  thirty  tasks,  with  task 
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times  chosen  according  to  a  uniform  distribution  between  0  and  1,  were  generated. 
C'mil  was  estimated  to  be  and  Cm„  (LPT)  is  the  length  of  the  LPT  schedule  gen¬ 
erated.  The  experiment  was  repeated  ten  times  and  the  values  of  of  the  relative  approx- 
C  (LPT)  -  C* 

imation  error,  - - ; - — ,  were  computed.  The  average  error  was  found  to  be 

C  m»i 

0.074.  In  a  second  experiment  [Coffman,  Garey,  and  Johnson,  1978],  task  times  were 

C  ,  (LPT)  -  c* 

chosen  according  to  a  normal  distribution.  The  average  of  the  value,  - - - — , 

^mn 

was  found  to  be  0.023.  Indeed,  it  has  been  proved  that  if  the  processing  requirements 
of  the  tasks  are  identically  distributed  random  variables,  then  as  the  number  of  tasks, 
N ,  approaches  infinity,  the  relative  approximation  error  approaches  zero  (asymptotic 
optimality)  with  probability  1  [Loululo,  1984;  Bruno  and  Downey,  1986;  Frenk  and 
Rinnooy  Kan,  1987], 

2.2.2  SCHEDULING  TREE  TASK  STRUCTURES 

From  the  preceding  section,  it  is  clear  that  the  problem  of  minimizing  makespan 
on  M  (where  M  22)  processors  belongs  to  a  class  of  difficult  combinatorial  problems. 
Nevertheless,  in  some  cases,  augmenting  the  problem  with  some  restrictions  converts  it 
into  a  simplified  problem.  Obviously,  the  problem  will  be  trivial  if  all  tasks  have  unit 
processing  time:  the  LS  strategy  is  optimal.  By  introducing  the  precedence  relations 
among  tasks,  nontrivial  problems  can  be  formulated.  Suppose  we  have  N  unit  process¬ 
ing  time  tasks  with  precedence  relations  in  the  form  of  a  tree,  which  must  be 
scheduled  on  M  identical  processors.  Hu  [1961]  and  Sethi  [1976]  proposed  an  O(N) 
optimal  algorithm  to  solve  this  type  of  problem.  Hu’s  idea  was  based  on  a  LS  strategy, 
where  the  priority  list  is  formed  based  on  node  levels.  The  level  of  a  node  i  in  the  task 
tree  is  defined  as  the  number  of  nodes  (including  node  i)  on  the  path  to  the  terminal 
node.  A  priority  list  is  then  constructed  according  to  nonincreasing  node  levels,  and 
then  the  LS  scheduling  algorithm  is  applied  using  the  priority  list. 
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For  example,  Fig.  2-2  shows  a  tree  task  structure  with  unit  processing  time  on  each 
task.  Task  a  should  be  assigned  first  since  it  has  the  highest  level  (i.e.,  4).  The  next 
task  to  be  assigned  is  task  c  with  its  level  equal  to  3,  and  so  on.  Fig.  2-3  shows  the 
results  of  scheduling  this  tree  task  structure  on  a  two-processor  system.  Since  the 
tasks  are  arranged  in  nonincreasing  levels,  we  can  also  apply  Hu’s  algorithm  to  solve 
problems  (with  nonidentical  task  processing  times  and  directed  acyclic  task  graphs) 
after  minor  modifications  [Kohler,  1975].  The  modification  pertains  to  the  definition  of 
a  node  level.  In  this  case,  the  level  of  a  node  i  is  the  length  of  the  longest  path  from 
node  i  to  the  terminal  node,  where  the  length  of  the  path  is  measured  in  terms  of  the 
processing  time  .  Therefore,  one  can  say  that  the  task  to  be  scheduled  next  is  the  one 
that  heads  the  current  longest  (critical)  path  in  the  precedence  graph.  Scheduling 
according  to  this  rule  is  termed  the  critical  path  (CP)  scheduling:  It  is  list  scheduling 
applied  to  a  list  of  tasks  arranged  in  nonincreasing  longest  paths. 


FIGURE  2-2:  A  TREE-STRUCTURED 
TASK  SYSTEM 


FIGURE  2-3:  LEVEL  SCHEDULEING 
ON  TWO  PROCESSORS 


2.2.3  PREVIOUS  MAPPING  APPROACHES 

The  mapping  problem  is  similar  to  the  previous  scheduling  problems  except  that 
precedence  relationships  may  exist  among  tasks.  In  addition,  the  data  dependencies 
among  tasks  induce  a  communication  pattern  among  processors  over  limited  capacity 
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channels  (links).  Several  approaches  with  different  optimality  criteria  have  been  sug¬ 
gested  to  solve  this  type  of  problem:  (1)  network  flows,  (2)  integer  programming,  (3) 
branch-and-bound,  and  (4)  heuristic  methods. 


1.  Network.  Flows: 

Stone  [1977]  studied  a  two-processor  mapping  problem  assuming  no  precedence 
restrictions  on  task  execution.  He  used  a  modified  intertask-communication  graph  by 
adding  the  processor  nodes  to  the  task  graph  such  that  an  edge  between  a  processor 
and  each  task  of  the  original  task  graph.  The  weight  on  the  new  edge  between  a  task 
and  a  processor  is  the  processing  cost  of  the  task  on  the  other  processor.  A  cut  of  the 
modified  graph  separates  the  graph  into  two  disconnected  parts,  each  part  representing 
an  allocation  of  tasks  to  a  processor.  The  cost  of  the  allocation  is  equal  to  the  sum  of 
the  weights  on  the  cut  and  the  minimum  cut  corresponds  to  the  optimal  mapping. 
Stone  used  the  modified  Ford- Fulkerson  maxfiow-mincut  algorithm  [Ford  and  Fulker¬ 
son,  1962]  to  minimize  the  sum  of  processing  and  communication  costs.  Since  the 
precedence  relationships  among  tasks  are  neglected  in  his  model,  the  mapping 
(7,  ,  r2 ,  •  ■  ■  ,  Tm)  can  be  represented  by  the  binary  matrix  X=[xrf]  where  xa  =  1,  if  task 
i  is  allocated  to  processor  k  and  x^=0  otherwise.  The  cost  function  is  then  formulated 
as: 


cost(X)  =  X  L  y.*  *.*  +  Z  Z  cv  xji 
k  =  \i  =  \  l<k  j<i 


(2-10) 


w'here  /a  represents  the  processing  cost  for  task  ;  c..  processor  k,  and  c,,  is  the  com¬ 
munication  cost  between  tasks  i  and  j,  when  those  tasks  are  assigned  to  different  pro¬ 
cessors.  The  first  summation  represents  the  processing  costs,  while  the  second  term 


represents  the  interprocessor  communication  cost  between  two-processors.  Although 
the  approach  is  very  elegant,  it  has  several  shortcomings.  First,  the  approach  is  limited 
to  two  processor  systems.  For  a  system  where  the  number  of  processors  is  greater  than 
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two,  the  computational  complexity  soon  becomes  intractable.  The  second  limitation  of 
the  approach  is  the  difficulty  in  incorporating  precedence  relationships  among  tasks, 
and  various  constraints  such  as  memory  and  redundancy  into  this  model. 

2.  Integer  Programming  Approach 

This  method  formulates  the  mapping  problem  as  a  binary  0-1  integer  program¬ 
ming  problem.  The  processing  costs  of  tasks  are  represented  by  an  N  x  M  matrix  F 
where  /iy  represents  the  processing  cost  when  task  i  is  allocated  to  processor  j.  The 
interprocessor  communication  cost  can  be  represented  by  a  product  of  volume  and  the 
distance.  If  we  define  V=(vi;},  i  =  1,2 =  1,2,..,  N,  where  v1;  represents  the  amount 
of  data  to  be  transferred  between  tasks  i  and  j,  and  0{ct/},  k  =  l,2,..M  and 
/  =  1,  2,..,  M,  is  the  distance  matrix  where  ck,  is  a  measure  of  communication  cost 
between  processors  k  and  l,  then  the  communication  cost  between  a  task  i  allocated 
to  processor  k  and  a  task  j  allocated  to  processor  1,  is  v.yc^.  Neglecting  the  precedence 
constraints,  we  can  formulate  the  objective  function  in  terms  of  the  allocation  (assign¬ 
ment)  matrix  X=[  xit  ]  as: 

M  N 

cost (X )  =  z  Z  V*  xik  +  Z  Z  w  V*7CU  x *  xji  ’■  (2-11) 

k  =  \i  =  \  l<k  j<i 

»  - 

The  first  summation  term  represents  the  processing  cost  for  each  task  on  its  assigned 
processor.  The  second  term  represents  the  sum  of  interprocessor  communication  cost. 
The  normalization  constant  w  is  used  to  scale  processing  and  interprocessor  communi¬ 
cation  costs  and  to  account  for  any  differences  in  units.  In  this  approach,  constraints 
on  memory  and  real-time  processing  can  be  easily  added.  A  limited  memory  environ¬ 
ment  can  be  represented  by 

N 

Z  m‘X.k  <Kt  .  *  =  l,2....,Af  ,  (2-12) 

/  =  1 

where  m,  represents  the  amount  of  memory  required  by  task  i  and  Rk  represents  the 
memory  capacity  at  processor  k.  The  real-time  constraint  can  be  represented  by 
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N 

X  ‘.kX*  <  dk  .  *=1.2.....*/,  (2-13) 

/  =  1 

where  t*  represents  the  processing  time  of  task  i  on  processor  k  and  dk  represents  the 
time  limit  for  processing  the  tasks  that  reside  in  processor  k.  The  equation  states  that 
the  time  required  to  complete  all  the  tasks  assigned  to  a  processor  must  not  exceed  the 
time  limit.  The  integer  programming  approach  provides  a  good  representation  of  the 
task  allocation  environment.  The  constraints  can  be  easily  added  or  changed  in  the 
model  and  is  an  appropriate  representation  of  the  application  algorithms.  The  approach 
is  usually  limited  to  small  size  mapping  problems,  since  the  computational  and 
memory  requirements  grow  exponentially  with  the  problem  size.  In  addition,  the 
approach  ignores  queueing  delays  at  the  communication  links,  i.e.,  whenever  data  com¬ 
munication  needs  to  take  place  between  two  processors,  a  path  between  two  processors 
is  assumed  to  be  free. 

3.  Branch-and-bound 

Shen  and  Tsai  [1985]  employ  a  branch-and-bound  method  using  the  minimax 
criterion  to  select  an  optimal  task  assignment.  This  has  been  shown  to  be  isomorphic 
to  a  graph  matching  problem  termed  weak  homomorphism.  The  search  for  optimal 
weak  homomorphism  corresponding  to  an  optimal  task  assignment  is  solved  via  a  state 
space  search.  Due  to  weak  homomorphism,  neighboring  tasks  are  always  assigned  to 
neighboring  processors.  Their  approach  is  a  heuristic  search  similar  to  A*  algorithm 
and  the  objective  function  is  rr.jre  realistic  than  the  previous  ones.  However,  pre¬ 
cedence  relationships  among  tasks  are  neglected  in  their  problem  formulation. 

4.  Heuristic  Method 

Heuristic  approaches  provide  fast  and  effective  means  to  obtain  suboptimal  map¬ 
ping  solutions.  Kasahara  and  Narita  [1984]  propose  a  heuristic  method  termed  critical 
path/most  immediate  successors  first  (CP/MISF)  to  minimize  the  completion  time  of 
the  terminal  task.  The  level  /,  of  task  i  used  in  CP/MISF  method  is  defined  to  be  the 


-  21  - 


longest  path  length  from  the  terminal  node  to  the  node  i.  Their  approach  consists  of 
the  following  three  steps: 

a.  Determine  the  level  of  each  task. 

b.  Form  a  priority  list  in  nonincreasing  order  of  node  levels  and  the  number  of 
immediate  successor  tasks. 

c.  Employ  LS  strategy  on  the  priority  list. 

The  method  of  CP/MIS F  is  an  improved  version  of  critical  path  scheduling. 
From  the  previous  discussion  on  scheduling  independent  and  tree  structured  tasks,  the 
worst  case  performance  of  this  algorithm  is  given  by  the  following  equation. 


^~mn  ^~mii  1 

CT ^  ~  M 


(2-14) 


where  CT ^  is  the  completion  time  with  CP/MISF  method  and  CT'm,x  is  the  optimal 
completion  time.  After  the  CP/MISF  method  is  applied  to  find  a  solution,  another 
method,  termed  depth  first/immediate  heuristic  search  (DF/IHS),  is  then  employed  to 
improve  the  CP/MISF  method  by  using  the  completion  time  of  CP/MISF  method  as  an 


upper  bound.  In  their  approach,  communication  overhead  among  tasks  is  neglected. 


Bokhari  [1981  A]  formulated  the  mapping  problem  as  one  of  maximizing  the  car¬ 
dinality  of  mapping,  i.e.,  the  number  of  edges  of  the  task  graph  that  fall  on  the  links 
in  the  processor  graph.  The  heuristic  algorithm  proposed  by  Bokhari  involves  the  itera¬ 
tive  interchange  of  mapped  nodes  to  increase  the  cardinality  of  mapping  at  each  itera¬ 
tion.  There  are  several  drawbacks  to  this  approach.  The  processors  and  links  are 
assumed  to  be  identical.  In  addition,  the  computation  is  assumed  to  be  regular,  i.e.,  all 
tasks  have  identical  processing  times  and  the  amount  of  data  to  be  transmitted  between 
a  pair  of  tasks  is  the  same.  Later  Bokhari  [198 IB]  employed  a  dynamic  programming 
method  to  minimize  the  sum  of  execution  and  interprocessor  communication  cost  for 
tree  task  structures.  If  the  cost  function  involves  time,  this  approach  will  minimize  the 
serial  execution  time,  i.e.,  at  most  only  one  processor  is  active  at  any  time. 
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Lee  and  Agarwal  [1987]  suggest  a  new  objective  function  which  takes  the  weight 
of  the  task  edges  and  the  actual  system  distance  into  account  to  minimize  the  com¬ 
munication  overhead.  The  basic  idea  of  their  mapping  strategy  is  that  tasks  communi¬ 
cating  more  frequently  than  the  others  are  to  be  placed  closer.  In  their  problem  formu¬ 
lation,  the  objective  function  can  quantify  the  real  communication  among  tasks  more 
accurately,  but  the  computation  time  of  tasks  is  neglected.  Furthermore,  a  system  node 
can  accommodate  at  most  one  task,  which  means  that  the  number  of  processors 
should  be  greater  than  or  equal  to  the  number  of  tasks. 

2.3  DRAWBACKS  OF  THE  PREVIOUS  MAPPING  APPROACHES 

The  previous  mapping  approaches,  although  varied  in  terms  of  solution  techniques 
used,  have  two  major  shortcoming.  First,  the  optimality  criterion  involving  the  sum  of 
processing  and  communication  costs  neglect  the  synchronization  delays  due  to  pre¬ 
cedence  constraints,  and  assume  that  the  communication  delay  in  transferring  data 
between  a  pair  of  processors  is  independent  of  the  data  traffic  between  the  processors. 
In  queueing  parlance,  the  constant  delay  assumption  is  tantamount  to  modeling  each 
communication  link  by  an  infinitive  server  [Lavenberg,  1983].  Second,  the  approaches 
that  take  into  account  the  precedence  constraints  assume  zero  communication  delays 
[Kasahara  and  Narita,  1984],  and  those  that  consider  communication  delays  neglect 
synchronization  delays  [Shen  and  Tsai,  1985], 

The  mapping  algorithms  developed  in  this  report  overcome  the  drawbacks  of  pre¬ 
vious  approaches.  The  key  features  of  our  problem  formulation  include:  (1)  explicit 
consideration  of  precedence  restrictions  among  tasks  and,  hence,  the  synchronization 
delays,  (2)  sequencing  of  data  messages  to  account  for  queueing  delays  at  communica¬ 
tion  links,  and  (3)  the  incorporation  of  storage,  security,  and  fault-tolerance  require¬ 
ments.  In  the  next  section  we  develop  both  heuristic  and  optimal  algorithms  to  solve 
the  general  mapping  problem  formulated  in  section  2.1. 


SECTION  3 

MAPPING  ALGORITHMS 


3.1  KEY  MAPPING  EQUATION 

In  the  general  mapping  problem  formulated  in  subsection  2.1,  the  completion  time 
of  a  task  /  on  an  assigned  processor  q  is  a  function  of:  (1)  the  service  demand  of  task 
i  and  the  service  rate  of  processor  <?;  (2)  the  time  at  which  the  data  from  immediate 
predecessor  (or  parent)  tasks  of  task  i  is  available  at  processor  q\  and  (3)  the  time 
when  processor  q  becomes  available.  The  delay  due  to  data  dependency  accounts  for 
synchronization  requirements  and  communication  delays.  The  consideration  of  proces¬ 
sor  availability  serves  as  a  model  for  queueing  delays  due  to  previously  assigned  tasks. 
In  this  subsection,  we  derive  an  explicit  equation  for  the  evolution  of  the  completion 
time  as  a  function  of  mapping,  which  forms  the  basis  for  our  mapping  algorithms  of 
sections  3.2-3A  For  ease  of  exposition,  we  derive  the  mapping  equation  for  the  uncon¬ 
strained  and  constrained  problems  separably.  The  mathematical  notation  used  in  this 
section  is  shown  in  table  3-1. 

3.1.1  Mapping  Equation  without  Constraints 

Consider  the  mapping  problem  without  memory,  security,  and  redundancy  con¬ 
straints.  Let  (7,,  72,.„  Tq , ...  7M)  denote  a  partial  mapping  and  let  CT,  (  71(  72 . ...  TM  )  be 
the  corresponding  completion  time  of  task  j,  j  e  V,.  Since  there  are  no  redundancy 
constraints,  the  ordered  sets  7  are  disjoint.  Suppose  we  want  to  assign  a  ready  task  i  to 
processor  q  so  that  the  new  mapping  is  (7),  72,.„  ...  Tu).  We  would  like  to  derive 

an  expression  for  the  completion  time  of  task  i  on  processor  q, 
CT j  (  Tu  r2. ...  Tq\ji, ...  Tm  ),  given  the  partial  mapping  and  the  corresponding  comple¬ 
tion  times. 
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term _ meaning _ 

V,  set  of  tasks  in  the  task  graph 

V,  set  of  processors  in  the  processor  graph 

M  number  of  processors  =  I  V,  I 

N  number  of  tasks  =  I  V,  I 

Si  number  of  instructions  of  task  i  in  millions 

m,  memory  requirement  of  task  i  in  Kb 

r,  redundancy  of  task  i 

v;j  amount  of  data  transferred  between  tasks  i  and  j  in  K  bytes 

3,  set  of  immediate  predecessor 

(parent)  tasks  of  task  i 

a,  set  of  immediate  successors  of  task  i 

p,  service  rate  of  processor  q  in  millions  of  instructions  per  second  (MIPS) 

Rq  memory  capacity  of  processor  q  in  Kb 

Cpq  capacity  of  link  (p#) 

Si  set  of  processors  where  task  i  can  be  assigned  due  to 

security  constraints,  i  zV, 

T„  set  of  tasks  assigned  to  processor  q,  q  zV, 

Pi  set  of  distinct  processors  to  which  task  i  is  assigned,  i  zV, 

CTi<a  completion  time  of  the  o'*  copy  of  task  j,  a=l,  2,  .....  r,;  ieV, 

CTi  completion  time  of  task  i 

l  last  task  assigned  to  processor  q 

Am  the  available  time  of  link  (p,q ) 

i\yj,a]  j *  earliest  completed  parent  of  task  i.  task  y,  is  the  corresponding 

parent  task,  and  a  is  the  copy  of  yh  ytz  3,  and  a  =  1,  2,  ...,  ry.,  i  z  V, 
j  the  time  at  which  data  from  the  j*  earliest  parent  of  task  i, 
denoted  by  /  Cyy  ,  a],  is  available  at  processor  q,  \  zV,  q  zV, 
i  [yi  jlk  earliest  completed  parent  of  task  i,  j  =  1,  2,  ...,  i  3;  I 

when  redundancy  r,  =  I,  /  e  V, 

the  time  at  which  data  from  the  j0'  earliest  completed  parent  of  task  i 
is  available  at  processor  q  when  redundancy  r,=l,  i  z  V,  q  z  V, 

Mi  set  of  processors  to  which  task  /  can  be  assigned 

without  violating  the  memory  constraint 
D?  latest  time  at  which  data  from  all  the  parents 

_ of  task  j  is  available  at  processor  a _ 

TABLE  3-1:  DEFINITION  OF  VARIABLES 

As  discussed  earlier,  task  i  can  not  begin  execution  on  processor  q  until  the  data 
from  predecessors  of  task  i  is  available  at  processor  q ,  and  until  processor  q  completes 
the  execution  of  the  end  task  in  the  ordered  set,  Tq.  Let  D»  denote  the  latest  time  at 
which  data  from  all  the  predecessors  of  task  i  becomes  available  at  processor  q,  and 
let  /  be  the  end  task  in  the  ordered  set,  Tq.  Then,  the  start  time  of  task  /  is 
max[D,*,  CT,  (7,,  T2,.„  T„, ...  Tm  )  ]  and  the  completion  time  is  given  by 


CTi  (  7,.  T2,.„  ...  Tm  )=  max  [  D?  ,  CT,  (  T r2,.„  T„ ,  ...  TM  )  ]  +—  (3-1) 

l1? 
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The  latest  time  at  which  data  from  all  the  predecessors  of  task  i  becomes  available  at 
processor  q  is: 


Df  = 


max  Bfa 
7=1,2,..,  I P,  I 


(3-2) 


where  Bfa  is  the  time  at  which  data  from  the  j*  earliest  completed  parent  task  of  task 
i  becomes  available  at  processor  q  and  p,  is  the  set  of  parents  of  task  i.  To  compute 
B^j],  we  make  the  reasonable  assumption  that  the  data  communication  from  parent 
tasks  takes  place  in  the  order  in  which  they  are  completed,  i.e.,  the  first  completed 
parent  task  sends  its  data  first.  The  variables  Bfa  can  be  computed  via  the  following 
algorithm. 


Algorithm  3.1:  Computation  of  <3,\,] .  Given  the  completion  times  of  parent  tasks  of  a 
task  /,  CTlt>],  1  <  j  <  l  p.  I  arranged  in  nonincreasing  order,  and  the  link  available  times 
An,  the  following  algorithm  computes  Bfa]. 

For  j  a  1  to  I  p,  I  do 

:=  {  p  :  i[j]  e  TP  J 

If  k0*q  then 

Find  the  shortest  path  ( ko,  ku  ..4)  from  processor  k0  to  processor  q 
Ak()kl  :=  max  {  CT^  }  +  ,f7,‘ 

*.♦»  :=  q 

For  i  =  1  to  n  do 


Av  i  *. —  rn&x  (  Aj  l  ,  Ah  k  )  + 


V‘UV 

1 


end  do 
'•= 

else 


5 'll  CTi\j\ 


end  if 
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end  do 

3.1.2  Mapping  Equation  with  Constraints 

The  memory  and  security  constraints  simply  restrict  the  feasible  processor  assign¬ 
ments  for  a  task.  However,  the  redundancy  constraint  imposes  additional  synchroniza¬ 
tion  delays,  since  a  task  i  can  not  begin  execution  until  the  data  from  all  copies  of  the 
parent  tasks  of  task  i  is  available  at  the  assigned  processor.  Therefore,  we  must  keep 
track  of  the  completion  time  of  each  copy  of  a  task.  In  addition,  the  ordered  sets  Tq  in 
( T,,  r2,.„  7^, ...  Tu)  are  not  disjoint,  since  a  task  /  must  be  replicated  at  multiple  proces¬ 
sors. 

To  derive  the  mapping  equation,  let  denote  the  set  of  processors  to  which  a 
ready  task  i  can  be  assigned  without  violating  the  memory  constraints,  i.e., 


and  let  St  be  the  set  of  processors  to  which  task  i  can  be  assigned  based  on  security 
considerations.  Then  the  set  of  feasible  processor  assignments,  Fp  is  given  by 


Fp  =  |  |  (3-4) 

The  completion  time  of  (any  copy  of)  task  i  on  processor  q ,  q  e  Fp  is  given  by: 

T2,..t  Tq  ,  ...  Tm)~  CTjiT),  T2,..,  ...  Tm) 

=  max  (  D ?,  CT,(TU  T2,.„  T, , ...  T») )  +■—  (3-5) 

where  D *  should  be  interpreted  as  the  latest  time  at  which  data  from  all  copies  of  the 
parent  set  ft  becomes  available  at  processor  q.  If  we  let  ify  ,a]  denote  the  jA  earliest 
completed  parent  task  of  task  i  where  y,  is  the  identity  of  the  parent  task  and  a  is  its 
copy  (1  sa<  ry),  then  is  given  by: 


D,9  =  max  max 

yjt P,  r 


(3-6) 
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where  fli(y  ^  is  the  time  at  which  data  from  copy  a  of  task  y}  becomes  available  at  pro¬ 
cessor  q.  For  a  feasible  assignment  of  task  i  on  processor  q,  the  variables  Bfa.^  can 
be  computed  as  follows: 

Algorithm  3.2:  Computation  of  Given  the  completion  times  of  each  copy  of 

parent  tasks,  CT^.^  arranged  in  decreasing  order  and  the  link  available  times,  A„,  the 
following  algorithm  computes  Bfo  a] . 

For  each  (yJt  a),  y,  e  ft  and  a  <jy.  do 

k0  :=  assigned  processor  of  the  a *  copy  of  task  y, 

If  k0*q  then 

Find  the  shortest  path  (k0,k\,  •  •  • ,  kH,  q)  from  processor  k0  to  processor  q 

A*o*i  •=  max  •  ^*»fc yjM 

K+i  :■  q 

For  i  =  1  to  n  do 


AVi.i :=  max<V.*.  •Vi..)  +  T7 


end  do 

)  :=A*A.. 

else 

■ = 

end  if 
end  do 


3.2  HEURISTIC  MAPPING  ALGORITHM 

The  heuristic  algorithm  consists  of  two-stages.  The  first  stage  employs  the  con¬ 
cept  of  critical  path  to  determine  the  order  of  task  execution,  while  the  second  stage 
sequentially  allocates  the  tasks  from  the  ordered  list  to  processors  so  that  the  comple- 
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tion  time  of  the  tasks  is  a  minimum.  To  determine  the  order  of  task  execution,  we 
define  the  level  of  a  task  /  as: 

/,•  =  max  £  -f  (3-7) 

*  j  tnk  ^ 

where  \Lf  =  max  \it  is  the  service  rate  of  the  fastest  processor,  sj  is  the  service  demand 
qtV, 

of  task  j,  and  nk  denotes  the  k *  path  from  task  /  to  the  terminal  task.  That  is,  /,  is  the 
length  of  the  critical  path  from  task  i  to  the  terminal  task.  By  construction,  /,  is  a 
lower  bound  on  the  completion  time  of  a  task  graph  rooted  at  task  i.  Following  the 
level  algorithm  of  Hu  [1961]  and  Sethi  [1976]  discussed  in  section  2.2,  the  heuristic 
algorithm  is  based  on  the  premise  that  tasks  with  larger  levels  should  be  executed  ear¬ 
lier  in  the  sequence.  If  several  tasks  have  the  same  level,  then  the  task  with  the  greater 
number  of  successors  should  be  completed  first.  Thus,  we  construct  the  execution  order 
according  to  nonincreasing  levels  first,  and  nonincreasing  successors  next  if  all  the 
tasks  have  the  same  levels. 

Once  the  priority  list  is  constructed,  we  sequentially  allocate  tasks  to  processors 
to  minimize  the  completion  time.  In  the  case  without  constraints,  task  i  is  assigned  to 
processor  q* ,  where 

q *  =  arg  min  CT,  (7),  72,..,  T,\ji , ...  TM  )  (3-8a) 

q 

In  the  constrained  mapping  problem,  we  assign  task  i  to  r,  distinct  processors 

q\,  q\ . <7*  that  yield  minimum  completion  time.  That  is,  the  assignments  qk( l  <.k  <,  rt) 

are  such  that: 

CT;(  T\,  T2,.„  ...  Tm  )  S  CT,(  Tu  T2,..,  Ty  )<>  .... 

“r. 

5  ct,(  r,.  t2,.„  T'.'  t  v Ji , ...  t„  )  (3-8b) 

The  heuristic  mapping  algorithm  proceeds  as  follows: 
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Algorithm  3.3:  Heuristic  Mapping  Algorithm.  Given  a  directed  task  graph 
G,  =  (  V,  ,  E,  )  with  parameters  ( j,  ,  m,  ,  r,  ,  v,7  )  and  the  processor  graph  G,  =  (  V,  ,  E,  ) 
with  parameters  ( np  ,  Rp  ,  c„  ),  the  heuristic  algorithm  computes  the  mapping 
(  Tu  r2,.„  T„, ...  Tm  ),  where  7*?  is  the  ordered  set  of  tasks  allocated  to  processor  q. 

Step  1:  Determine  the  level  of  each  task 
Z  =:(N} 

Repeat  until  Z  =  4> 

Select  a  task  i  of  Z  such  that  no  successors  of  task  i  appear  in  z 

Compute  the  level  of  task  i ,  /,  via:  /,•  =  max  (/,)  +  — 

yea,  l1/ 

Z  =  Z  -  {  / }  up, 
end 

Step  2:  Construct  a  priority  list  [1]  [2]  ...  [N]  by  sorting  /,  in  nonincreasing  order. 

Break  ties  on  the  basis  of  number  of  successors. 

Step  3:  For  j  =  1  to  N  do 

i  -  U] 

Form  the  feasible  set  of  processors  Fp  via  Eqs.  (3-3)  and  (3-4) 

Find  assignments  qk  (  l  £  k  s  r,  )  via  Eq.  (3-8) 

Tqk  -  T1k  i  ;  k= 1,  2,  ...  r, 

Update  the  link  available  times  via  Algorithm  3.1  or  3.2  as  appropriate 
end  do 

Illustrative  Example: 

Suppose  the  task  and  processor  graphs  are  as  shown  in  Fig.  3-1.  We  use  step  1  to 
determine  the  levels  of  tasks,  /,-105,  /*=  100,  /3=50,  /4=50,  /5=100,  /6=45,  /7=45,  lg= 0.  The 
priority  list  can  be  constructed  by  the  nonincreasing  levels,  or  [1]=1,  [2]=2,  [3]=5, 
[4]=4,  [5]=3,  [6]=6,  [7]=7,  and  [8]=8.  We  construct  the  set  of  feasible  processor 
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assignments  Fp  for  task  [1]=1  and  have  Fp  =  [q:q  eAfitjSi)  =  {1,  2}.  Since  task  [1]=1 
can  only  be  assigned  to  processors  1  and  2,  we  select  processor  2  which  has  the  ear¬ 
lier  completion  time  for  task  1.  Similarly,  the  set  of  feasible  processor  assignments  Fp 
for  task  [2]  is  Fp=  {1,  2,  3,  4}.  Since  task  [2]=2  has  a  replication  level  of  2,  we 
assign  task  2  to  two  different  processors  2  and  4  with  concomitant  completion  times  of 
60  and  70  time  units.  The  set  of  feasible  processor  assignments  Fp  for  task  [3]=5  is 
{2,  3,  4).  Similarly,  task  [3]=5  has  a  replication  level  of  2,  we  assign  task  5  to  two 
different  processors  2  and  1  with  corresponding  completion  times  of  115  and  116  time 
units.  Fp  for  task  [4]=4  is  ( 1 }  so  we  assign  task  [4]=4  to  processor  1  due  to  the  secu¬ 
rity  constraint.  Tasks  [5]  to  [8]  can  be  assigned  in  the  same  way.  The  completion  time 
of  the  terminal  task,  task  8,  is  216  for  this  example. 


FIGURE  3-1:  TASK  AND  PROCESSOR  GRAPHS  FOR 
AN  ILLUSTRATIVE  EXAMPLE 
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3.3  PAIR-WISE  EXCHANGE  ALGORITHM 

The  performance  of  the  heuristic  algorithm  can  be  improved  by  iteratively 
exchanging  the  order  of  elements  in  the  priority  list,  while  satisfying  the  precedence 
constraints.  The  following  lemma  defines  a  feasible  exchange  of  tasks  in  a  priority  list. 

Lemma  3.1: 

If  [1]  [2]..[j]..[k]..[N]  is  a  priority  list,  then  [1]  [2]..[k]..[j]..[N]  is  also  a  feasible 
priority  list  only  if  all  the  successors  of  task  [j]  are  executed  after  task  [k]  and  all 
the  parents  of  task  [k]  are  executed  before  task  [j],  i.e.,  c  {[k+1],  [k+2],  .., 
[N]]  and  pltlc{[l],  [2],  ..,  Q-l]}. 

Proof: 

We  will  prove  this  by  contradiction.  Since  [1]  [2]..[j]..[k]..[N]  is  a  feasible  prior¬ 
ity  list,  all  the  predecessors  of  task  [j]  must  be  executed  before  task  (j]  and  all  the 
successors  of  task  [k]  must  be  executed  after  task  [k],  i.e.,  pU)  c  {[1],  [2],  ...  [j-l]) 
and  a(Jkl  c  {[k+1]  [k+2]..[N]}.  If  any  successor  of  »?sk  HI  or  any  predecessor  of 
task  [k]  is  executed  between  tasks  [j]  and  [k],  then  an  exchange  on  the  order  of 
execution  of  tasks  [j]  and  [k]  will  require  the  execution  of  at  least  one  task  prior 
to  its  predecessor.  That  is,  the  new  priority  list  will  violate  the  precedence  con¬ 
straint,  completeing  the  proof. 
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Note  that  the  start  and  terminal  tasks  (i.e.,  tasks  [1]  and  [N])  can  not  be 
exchanged  with  any  task.  The  basic  idea  of  pair-wise  exchange  is  illustrated  in  Fig.  3- 
2.  The  pair-wise  exchange  algorithm  considers  all  possible  exchanges  of  the  type  illus¬ 
trated  in  Fig.  3-2,  and  terminates  when  all  feasible  exchanges  are  exhausted.  The 
algorithm  proceeds  as  follows: 


TASK  I 


T/^  J  EXECUTION  ORDER 


PARENTS  OF  TASK  J 

2. 


SONS  OF  TASK  I 


1_L 


.  EXCHANGE  TASK  2  WITHT"  ,  FROM  THESE  TASKS 

J.  -  ,  besi2 

bcsi2  3 


L  T 


_L 


j _ Li _ i  i  i  i  i 


■*bcst2  Jbcst3 


EXCHANGE  TASK  3  Wlf£fj .  ,  FROM  THESE  TASKS 

bcst3 


J _ I _ £ 


_L 


2  3 

I  1  I  I 


EXCHANGE  TASK  4  WlfFfT"  .  FROM  THESE  TASKS 

best4 


FIGURE  3-2::  CONCEPT  OF  PAIR-WISE  EXCHANGE  ALGORITHM 
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Algorithm  3.4:  Pair-wise  Exchange  Algorithm.  Given  a  directed  task  graph 
G,  =  (  V,  ,  E,  )  with  parameters  (  j,  ,  ny  ,  r,  ,  v,7  )  and  the  processor  graph  Gs  =  (  V,  ,  E,  ) 
with  parameters  ( p,  ,  R„  ,c„  ),  the  pair-wise  exchange  algorithm  computes  the  map¬ 
ping  ( Tif  T2,.„  T?, ...  Tm  ),  where  T9  is  the  ordered  set  of  tasks  allocated  to  processor  q. 
Step  1:  Same  as  step  1  of  algorithm  3.3 
Step  2:  Same  as  step  2  of  algorithm  3.3 
Step  3:  For  j=2  to  N-2  do 
flag=0 

For  k=j+l  to  N-l  do 

If  tasks  [j]  and  [k]  are  exchangeable  then 
Initialize  processor  and  link  available  times 
Swap  tasks  [j]  and  [k]  to  construct  a  new  priority  list 
Use  step  3  of  algorithm  3.3  to  find  a  new  mapping 
If  the  result  is  better  then 
k’:=  k 
flag=l 

Update  the  result 
end  if 

Swap  tasks  [j]  and  [k] 
end  if 
end  do 

If  fiag=l  then 

swap  tasks  [j]  and  [k’j 
end  if 
end  do 

In  algorithm  3.4,  the  inner  do  loop  finds  the  best  task  [k’]  for  the  j'1,  position  of 
the  execution  order  and  the  outer  do  loop  controls  the  value  of  j  and  then  exchanges 
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the  jA  position  with  the  best  task  found,  if  it  results  in  a  better  mapping.  Since  there 
exists  a  start  task  and  a  terminal  task  in  the  task  graph,  it  is  obvious  that  [l’]=[l]  and 
[N’]=[N]  and  hence  no  other  tasks  are  exchangeable  with  these  two  tasks. 

Illustrative  Example: 

Consider  task  and  processor  graphs  shown  in  figure  3-1.  We  construct  the  prior¬ 
ity  list  [1]=1,  [2]=2,  [3]=5,  [4]=4,  [5]=3,  [6]=6,  [7]=7,  [8]=8.  Using  the  heuristic  algo¬ 
rithm,  we  have  completion  time  216.  Now  we  exchange  the  priority  list  of  tasks  [2] 
and  [3],  the  new  priority  list  becomes  [1]=1,  [2]=5,  [3]=2,  [4]=4,  [5]=3,  [6]=6,  [7]=7, 
[8]=8.  Now  use  step  3  of  algorithm  3.3  to  find  a  new  mapping  with  its  completion 
time  205,  so  we  keep  this  priority  list  and  its  corresponding  mapping.  We  exchange 
the  the  other  pairs  of  priority  list  such  as  tasks  [2]=2  and  [4]=4,  [2]=2  and  [5]=3  and 
find  the  completion  time  of  the  terminal  task  based  on  these  priority  lists  is  not  earlier 
than  205.  Hence,  we  swap  tasks  [2]  and  [3],  fix  task  [2]=5,  and  get  a  new  priority  list 
[1}=1,  [2]=5,  [3]=2,  [4]=4,  [5]=3,  [6]=6,  [7]=7,  [8]=8.  We  continue  exchange  tasks 
[3]=2  and  [4]  =4,  [3]=2  and  [5]=3,  and  so  on.  Since  we  cannot  find  an  earlier  comple¬ 
tion  time  than  205,  so  the  output  will  be  the  priority  list  [1]=1,  [2]=5,  [3]=2,  [4]=4, 
[5]=3,  [6]=6,  [7]=7,  [8]=8,  the  corresponding  mapping,  and  the  completion  time  205. 

3.4  OPTIMAL  MAPPING  (A‘  AND  A[)  ALGORITHMS 

Since  the  heuristic  and  pair-wise  exchange  algorithms  do  not  guarantee  an 
optimal  solution,  we  develop  an  optimal  mapping  (A*)  algorithm  which  forms  a  bench 
mark  against  which  to  evaluate  the  performance  of  two  heuristic  algorithms.  The  A* 
algorithm  employs  the  heuristic  algorithm  to  find  an  upper  bound  (UB)  on  the  comple¬ 
tion  time  and  then  searches  for  optimal  allocation  from  all  possible  combinations  of 
sequencing  orders  and  allocations.  The  state  space  of  the  task  allocation  problem  can 
be  conceptualized  as  a  decision  tree,  wherein  node  n  is  parameterized  by  the  5-tuple 
(/„,  .CTV  t,),  where  i„  is  the  task  at  node  n,  P is  the  set  of  processors  where 
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task  i,  is  assigned  to,  /„  is  the  cost  constrained  to  go  through  node  n,  CTt>  is  the  max¬ 
imum  completion  time  of  task  i„  at  node  n,  and  t*  is  the  parent  node  of  node  n .  The 
heuristic  evaluation  function  (HEF),  also  termed  cost  selection  function  /„,  used  by  A* 
algorithm  consists  of  three  pans,  gH,  h„,  and  /v  where  g„  =  CT h„  is  the  estimated 
cost  of  those  unassigned  tasks  at  node  n,  and  /t<  is  the  cost  of  its  parent  node  for  node 
n.  Since  a  task  graph  may  consist  of  several  routes  starting  from  the  start  task  to  the 
terminal  task,  we  may  generate  a  node  n  with  its  task  i„  lying  in  a  path  while  the 
predecessor  node  x„  with  its  task  i'T>  lying  in  other  paths.  Once  we  visit  node  n,  we 
may  use  the  information  of  HEF  at  node  / v  as  a  reference  if  the  value  of  gH  +  hn 
is  less  than  /T .  The  reason  is  due  to  the  lower  bound  cost  of  going  through  node  t,  is 
/t„,  so  /t<  is  also  a  lower  bound  cost  at  node  n.  As  explained  in  section  3.1,  the  level 
/,  of  task  i  is  a  lower  bound  on  the  completion  time  of  a  task  graph  rooted  at  task  i. 
Since  the  processing  time  of  task  i„  on  processor  set  Pim  has  been  included  into  Cr(>, 
the  estimated  cost  of  those  unassigned  tasks  is  equal  to  the  level  of  task  /(<,  minus 
the  service  time  of  task  i„  on  the  fastest  processor,  In  other  words. 


The  heuristic  evaluation  function  (HEF,  also  termed  cost  selection  function)  /„  is  then 
determined  by: 

/,  =  max  (gn  +h„  ,/T>  )  (3-10) 

The  HEF,  /„,  defined  in  this  way  is  admissible,  i.e.,  a  guarantee  for  an  optimal  solu¬ 
tion.  We  start  from  the  start  node  (root  node,  node  0)  with  its  estimated  cost  h0 
evaluated  to  be: 

N  M 

h0  =  max  (/(iIt[£  Si/  £  P,1  )  (3-11) 

i=l  i=l 
N 

where  /m  is  the  level  of  the  start  task,  £  j,  is  the  total  service  demands  of  tasks, 

i=l 
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M 

£  4,  is  the  service  rate  of  the  computer  system.  The  estimated  completion  time  is 
i= 1 

known  to  be  lower  bounded  by  the  level  of  the  start  task.  Since  the  total  service 
demands  of  tasks,  divided  by  the  service  rate  of  the  computer  system,  is 
another  lower  bound  of  completion  time,  h0  is  taken  from  the  larger  of  these  two 
values.  The  cost  at  node  0  is  /0  =  g0  +  since  g 0  =  0. 

The  ready  task  set  at  node  0  is  the  start  task  set  {[1]}  where  [1]  represent  the 
start  task.  We  construct  the  feasible  processor  assignment  set  Fp  for  task  [1]  and  then 
expand  node  0  by  assigning  the  start  task,  task  [1],  to  processor  set  F(i]  for  every 
P{w<dFp  and  evaluate  the  corresponding  5-tuple  (/„,  ,CTim,  %)  at  each  node  n.  If 
the  value  of  estimated  cost  at  node  n ,  /„ ,  is  less  than  or  equal  to  the  upper  bound  of 
completion  obtained  from  the  heuristic  algorithm,  we  retain  node  n  in  the  decision  tree. 
Otherwise,  node  n  is  "fathomed".  Once  a  node  has  been  expanded,  we  put  that  node 
into  a  set  CLOSED.  We  continue  to  select  a  node  n  for  expansion  whose  /„  is  a 
minimum  among  those  unexpanded  nodes.  The  ready  task  set  at  node  n  can  be  con¬ 
structed  from  the  path  starting  from  node  0  to  node  n,  P o_„  and  the  precedence  rela¬ 
tionships  of  the  task  graph.  Once  the  ready  task  set  is  constructed,  each  task  (element) 
i  in  the  ready  task  set  at  node  n  is  again  assigned  to  processor  set  Pt  for  every  P,<zFp. 
The  number  of  new  nodes  generated  by  assigning  tasks  to  processors  is 

1  p  | 

£  Cr.  ”  .  Among  these  new  nodes,  only  those  tuples  whose  /?  is  less 

i  e  ready  task  set 

than  or  equal  to  the  upper  bound  (UP)  need  to  be  retained.  We  repeat  this  procedure 
until  we  construct  an  empty  ready  task  set  at  node  5.  Note  that  every  node  expansion 
will  add  new  nodes  to  the  decision  tree.  A  complete  path  is  the  path  starting  from 
node  0  to  the  goal  node  8.  All  the  other  paths  are  incomplete.  The  priority  list  on  each 
processor,  the  completion  of  tasks  and  their  corresponding  allocations  can  be  known 
from  the  optimal  path,  The  completion  time  of  the  goal  node,  C7)  is  the  comple¬ 
tion  time  of  the  optimal  mapping  algorithm.  Let  T”  be  the  set  of  tasks  assigned  to 
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processor  q  at  node  m,  Zm  be  a  set  contains  the  ready  task  set  at  node  m,  and  am 
corresponds  to  the  processor  at  node  m.  The  A*  algorithm  is  described  in  the  follow¬ 
ing: 

Algorithm  3.6:  Optimal  mapping  (A*)  algorithm.  Given  a  directed  task  graph 
G,  -  (  V,  ,  E, )  with  parameters  (  j,  .  /n,  .  r-,  ,  viy  )  and  the  processor  graph  G,  =  (  V,  ,  E,  ) 
with  parameters  ( fip  ,RP,  c„  ),  the  A*  algorithm  computes  the  optimal  mapping 
(  7t,  72,.„  7,, ...  tm  ).  where  Tq  is  the  ordered  set  of  tasks  allocated  to  processor  q. 

Step  1:  Use  algorithm  3.3  to  find  an  upper  bound  (UB)  on  the  completion  time 
Step  2:  C=4> 
n=0 

Z={[1]} 

Select  a  task  i  of  Z  such  that  no  predecessors  of  task  i  appear  in  Z 
.  Si 

cost-to-go:=  li - 

H/ 

Form  the  set  of  feasible  processor  assignments  Fp  for  task  i  via  eqns  3.3  and  3.4 
For  every  possible  set  P,  such  that  1 P,  I =r,  and  P,  c  Fp  do 
For  a=l  to  n  do 

q  =  a*  element  of  set  P, 

Find  CTi/,(Jl,T2..,T1\ji  via  equations  3-1,  3-2,  and  algorithm  3.1 
end  do 

If  (  max  CT1>(r1,r2,..>7?^ji,..Jw)+cost-to-go  £  UB  then 
1  S  a  <,  r, 

n=n+l 

N  M 

f(n)=  max{  max  CTiia(r,>72,..,7?^j/,..,rM)+cost-to-go,  (  £  s„)l  £  (lO) 
l  la  in  n  =  {  m  =  l 

*.=0 
d„=  q 


end  if 
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end  do 

Z0=  Z-{i)vja, 

Step  3:  m  =  arg  min  /(/) 
isi  <jii$C 

If  Tm=0  then 

For  <?=1  to  M  do 

end  do 
else 

For  /= 1  to  M  do 


end  do 
end  if 
Zm:=Zt- 
C=C  \j  {m} 

Step  4:  If  Zm=* |>  output  the  result 

Select  a  task  i  from  Zm  such  that  no  processors  of  task  i  appear  in  Zm 
Zw=Z„-{i)^a i 

Si 

cost-to-go:=  /, - 

M/ 

Form  the  set  of  feasible  processor  assignments  Fp  for  task  i  via  eqns  3.3  and  3.4 
For  every  possible  set  P,  such  that  IP,  l=r;  and  P,  c  Fp  do 
For  a  =  1  to  r,  do 

q  =  a*  element  of  set  P, 

Find  CTira  (TT  ,T”  via  eqn.  3-5,  3-6,  and  algorithm  3.2 

end  do 

If(  max  {  CTis(T?,Tl,..,T?yji,..JZ})}  +  cost-to-go  S  UB  then 

1  <,  a  <,  ri 
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n=n+l 

f(n)=  max{  max  {  CTi/t{V? ,T? ,..?$))  +  cost-to-go,  f(t„)} 

1  <i  a  Sr; 

xn-m 

T?  •  T?KJ  (  i  ) 
end  if 
end  do 

If  Zm  =  ()>  then  C=C-{m} 

Go  to  step  3 

Illustrative  Example: 

Consider  a  simple  example  where  task  and  processor  graphs  as  shown  in  figure 
3-3.  We  use  the  heuristic  algorithm  to  find  an  upper  bound,  UB=15.5,  on  the  comple¬ 
tion  time,  then  start  the  search  process.  The  search  steps  can  be  constructed  as  figure 
3-4.  If  we  assign  a  task  to  a  processor  where  the  HEF  is  greater  than  UB,  the  5-tuples 
of  that  node  will  not  be  retained.  Each  time  a  leaf  node  with  the  lowest  cost  is  selected 
for  expansion.  The  number  of  processors  assigned  to  a  task  at  a  node  is  equal  to  the 
replication  number  of  that  task.  The  cost  at  a  node  is  taken  from  the  maximum  HEF 
among  the  costs  of  the  same  task  at  different  processors  and  its  parent  task.  The  com¬ 
plete  path  from  root  node  to  the  goal  node  corresponds  to  a  mapping. 

Although  the  A*  algorithm  provides  us  an  optimal  solution,  the  entire  decision 
tree  is  usually  very  large.  For  a  problem  where  a  large  number  of  nodes  must  be  gen¬ 
erated  before  an  optimal  solution  can  be  determined,  the  A*  algorithm  can  not  be  used 
to  solve  the  problem  due  to  the  limitations  of  memory  size  and  CPU  time.  Experience 
shows  that  in  many  problems  A*  algorithm  spends  a  large  amount  of  time  discriminat¬ 
ing  among  paths  whose  costs  do  not  vary  significantly  from  each  other.  In  such  cases 
the  admissibility  property  becomes  a  curse  rather  than  a  virtue.  It  forces  A*  to  spend  a 
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FIGURE  3.3:  AN  ILLUSTRATIVE  EXAMPLE 
FOR  A*  ALGORITHM 


3.(3} 


FIGURE  3.4:  STATE  SPACE  SEARCH  FOR  A*  ALGORITHM 
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disproportionately  long  time  in  selecting  the  best  among  roughly  equal  candidates  and 
prevents  A*  trom  completing  the  search  with  a  suooptimai  but  otherwise  acceptable 
solution.  The  A’t  algorithm  is  similar  to  the  A*  algorithm  and  is  developed  to  com¬ 
pensate  this  drawback.  In  A\  algorithm,  e  is  a  nonnegative  value  which  determines  the 
deviation  of  the  final  solution  to  the  optimal  solution.  In  general,  larger  the  value  of  e, 
the  less  memory  and  CPU  time  is  required  to  get  the  final  solution.  This  value  is  used 
to  control  the  speed  of  getting  a  solution  from  the  decision  tree.  The  main  difference 
between  the  A*  and  the  a[  algorithms  is  the  determination  of  node  expansion  order.  In 
A*  algorithm,  the  leaf  node  for  expansion  must  be  a  node  with  a  minimum  HEF 
among  the  leaf  nodes.  In  A't  algorithm,  the  leaf  node  for  expansion  is  a  node  with  its 
HEF  within  (1  +c)//£Fmin  where  HEF ^  represents  the  minimum  HEF  value  among  the 
leaf  nodes.  In  fact,  the  A*  algorithm  is  a  special  case  when  e=0. 

Algorithm  3.7:  A  l  Algorithm.  Given  a  directed  task  graph  G,  •  (  V,  ,  E, )  with  parame¬ 
ters  ( si  ,  m,  ,  rj  ,  vtj  )  and  the  processor  graph  G,  =  ( V,  ,  E,  )  with  parameters 
(  ^  ,RP  ,  c„  ),  the  A\  algorithm  computes  the  e-optimal  mapping  (  Tu  T2,..,  Tq, ...  TM  ), 
where  Tq  is  the  ordered  set  of  tasks  allocated  to  processor  q. 

Step  1:  Input  e 

Step  2:  Same  as  step  2  of  algorithm  3.5 

Step  3:  m  =  arg  min  /(/) 
l<i  <ji  i  $  C 

mincost=  min  f(i) 
l£i  4  C 

Step  3.1:  If  we  can  find  node  m  such  that  fm  <  ( \+i)xmincost  then 
go  to  step  3.2 
else 

go  to  step  3 


end  if 
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Step  3.2:  If  im  =0  then 

For  <7-1  to  tri  tio 

T?  =  TnVlam) 

end  do 
else 

For  i =1  to  M  do 


end  do 
end  if 

C=C  (m) 

Step  4:  Same  as  algorithm  3.5  except  go  to  step  3.1  instead  of  going  to  step  3 


3.5  SUMMARY 

In  this  section,  we  described  the  key  mapping  equations  v/ith  and  without  con¬ 
straints  and  the  four  mapping  algorithms.  The  mapping  due  to  task  queueing,  pre¬ 
cedence  relationships,  data  dependency  among  tasks,  and  message  collision  problems 
are  explicitly  considered  in  the  key  mapping  equation.  The  heuristic  algorithm  employs 
the  critical  path  method  to  determine  a  priority  list  of  execution  order  and  then  uses 
the  one-step  optimization  method  to  find  a  solution  of  the  mapping.  The  pair-wise 
exchange  algorithm  is  used  to  improve  the  solution  of  the  heuristic  algorithm  by 
exchanging  its  priority  list  on  the  order  of  execution.  Since  the  heuristic  and  pair-wise 
exchange  algorithms  do  not  guarantee  an  optimal  solution,  we  develop  an  optimal 
mapping  (A*)  algorithm  by  considering  all  possible  combinations  of  task  execution 
order  and  allocation.  A  node  with  a  minimum  HEF  is  selected  for  expansion  until  a 
complete  path  consisting  of  all  tasks  is  found.  The  path  corresponds  to  an  optimal 
mapping.  The  performance  of  the  A*  algorithm  is  limited  by  the  required  memory 
and  CPU  time.  In  order  to  reduce  the  requirements  of  CPU  time  and  memory  for  the 
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A*  algorithm,  A[  algorithm  is  developed  to  find  a  solution  whose  cost  does  not 
exceed  the  optimal  cost  by  more  than  a  factor  1+e. 
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SECTION  4  COMPUTATIONAL  EXPERIMENTS 

4.1  EXPERIMENT  1:  HYPOTHETICAL  EXAMPLES 

The  first  example  involves  the  mapping  of  independent  tasks  onto  identical  pro¬ 
cessors.  The  task  graph  is  a  vertex  graph  with  isolated  vertices  and  no  edges.  In  order 
to  apply  the  mapping  algorithms  of  section  3,  a  dummy  start  task,  task  10,  and  a 
dummy  terminal  task,  task  11,  are  added  to  the  graph,  as  shown  in  Fig.  4-1.  The  ser¬ 
vice  demands  of  these  two  dummy  tasks  are  zero.  The  links  between  the  dummy  start 
task  and  the  independent  parallel  tasks  are  added  to  the  task  graph.  Similarly,  links 
between  the  dummy  terminal  task  and  the  independent  parallel  tasks  aie  added.  The 
amount  of  data  transmitted  between  the  dummy  nodes  and  the  other  nodes  is  zero. 
The  processor  graph  is  a  2-cube  computer  system  with  unit  service  rate  and  unit  link 
capacity.  The  levels  of  tasks  are  the  same  as  the  service  demand  in  this  example 
except  for  the  dummy  start  task.  The  priority  list  then  is  (10,1,2,3,4,5,6,7,8,9  i).  Thus, 
scheduling  independent  tasks  using  the  heuristic  algorithm  is  equivalent  to  the  LPT 
rule.  The  completion  time  of  the  heuristic  algorithm  is  15,  which  corresponds  to  the 
worst  case  mapping  performance  of  the  LPT  rule.  In  the  four  processor  case,  the  worst 

Cr  -  CT * , 

relative  approximation  error,  — — ^=0.25  If  the  pair-wise  algorithm  is  used  in 

CT  mil 

this  example,  the  completion  time  is  improved  to  13  and  if  the  optimal  mapping  (A*) 
algorithm  is  used,  the  completion  time  is  12.  The  performance  of  the  three  algorithms 
is  illustrated  via  Gantt  Chans  in  Fig.  4-2. 
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PROCESSOR  GRAPH 
FIGURE  4-1:  MODIFIED  INDEPENDENT  TASK  GRAPH 
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The  second  example  is  related  to  the  mapping  of  a  tree-structured  task  graph  with 
nonidentical  service  demands  as  shown  in  Fig.  4-3.  A  dummy  start  task,  task  8,  is 
added  to  the  task  graph.  The  links  between  task  8  and  tasks  1,  2,  3,  4  are  added  into 
the  task  graph  as  in  the  previous  example.  The  levels  of  the  tasks  are  (4,  5,  3,  5,  2,  6, 
1,  6)  and  the  priority  list  constructed  by  the  heuristic  algorithm  is  (8,  6,  2,  4,  1,  3,  5, 
7).  The  mapping  provided  by  the  heuristic  algorithm  is:  r,  =  {8,  6,  1,  3,  5,  7}  and  T2= 
{2,  4}  with  the  concomitant  completion  time  for  the  heuristic  algorithm  of  9  time 
units.  If  the  pair-wise  exchange  algorithm  is  applied  to  solve  this  example  the  comple¬ 
tion  time  is  8,  which  is  an  optimal  mapping  in  this  case.  The  optimal  mapping  is: 
T,=  { 8,  4,  1,  3,  5,  7}  and  T2={ 2,  6).  Application  of  the  heuristic  algorithm  to  tree- 
structured  tasks  with  arbitrary  service  demand  for  each  task  corresponds  to  critical  path 
scheduling. 
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HEURISTIC  ALGORITHM:  PAIR-WISE  EXCHANGE  ALGORITHM:  A*  ALGORITHM: 

COMPLETION  TIME  =  15  COMPLETION  TIME  =  13  COMPLETION  TIME  =  12 

FIGURE  4-2:  GANTT  CHART  FOR  THE  EXAMPLE  OF  FIGURE  1 


The  third  task  graph  is  taken  from  Kasahara  and  Narita  [1984]  and  is  shown  in 
Fig.  4-4.  It  is  assumed  that  the  amount  of  data  transmitted  among  tasks  is  negligible. 
That  is,  precedence  constraints  among  tasks  are  considered,  but  the  communication 
among  tasks  is  neglected.  The  level  of  the  start  task  (in  this  case  9  time  units)  forms  a 
lower  bound  for  the  completion  time,  i.e.,  no  matter  how  many  processors  (with  the 
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service  rate  limited  by  the  fastest  service  rate  in  the  processor  graph)  are  used  to  solve 
this  problem,  the  completion  time  must  be  at  least  equal  to  the  level  of  the  start  task. 
In  this  example,  the  completion  time  of  the  heuristic  algorithm  is  9  which  is  equal  to 
the  level  of  the  start  task.  This  means  that  the  heuristic  algorithm  provides  the  optimal 
mapping  in  this  case.  If  a  completion  time  smaller  than  the  level  of  the  start  task  is 
desired,  it  can  be  accomplished  in  one  of  the  two  following  ways:  further  partitioning 
of  the  tasks  lying  on  the  critical  path  into  several  parallel  tasks  such  that  the  level  of 
the  start  task  is  reduced  or  replacing  the  fastest  processor  in  the  architecture  with  an 
even  faster  service  rate  processor.  However,  decomposing  a  task  into  several  subtasks 
may  result  in  additional  communication  among  subtasks. 


FIGURE  4-3:  MODIFIED  TREE-STRUCTURED  FIGURE  4-4:  TASK  GRAPH  EXAMPLE  FROM 

TASK  GRAPH  KASAHARA  AND  NARTTA,  1984 


The  fourth  example  corresponds  to  a  dynamic  scene  analysis  algorithm  and  is 
taken  from  Agrawal  [1986].  The  task  graph,  shown  in  Fig.  4-5,  consists  of  data  depen¬ 
dencies  and  precedence  relationships  among  tasks,  a  start  task  and  a  terminal  task,  and 
hence  can  be  directly  applied  to  our  model.  We  used  three  algorithms  to  solve  this 
problem.  The  completion  time  of  the  mapping  with  the  heuristic  algorithm  is  23.60, 
the  pair-wise  exchange  algorithm  provides  a  mapping  with  a  compi'  :ion  time  of 
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23.33,  and  the  Aj  algorithm  provides  a  mapping  v»itn  the  completion  time  of  23.06, 
when  e=0.01.  The  A*  algorithm,  when  applied  in  this  example,  did  not  terminate  even 
after  4000  node  expansions.  Although  we  were  unable  to  obtain  the  optimal  solution 
for  this  problem,  we  knew  that  the  solutions  of  the  heuristic  and  pair-wise  exchange 
algorithms  are  near  the  optimal  solution,  since  the  completion  time  of  the  mapping 
provided  by  the  A’t  algorithm,  23.06,  with  e  =  0.01  is  within  1%  of  the  optimal  comple¬ 
tion  time. 


5 

4 


FIGURE  4-5:  TASK  GRAPH  FROM  ARGAWAL,  1986 

The  fifth  example  is  taken  from  Chu  and  Lan  [1987].  The  control-flow  graph, 
consisting  of  AND  and  OR  nodes,  is  shown  in  Fig.  4-6.  Every  branch  of  the  AND 
node  should  be  executed.  However,  only  one  branch  of  the  OR  node  is  executed.  The 
probability  of  executing  each  branch  of  the  OR  node  is  shown  in  Fig.  4-6.  The 
control-flow  graph  can  be  converted  into  the  task  graph  by  computing  the  service 
demand  of  a  task  as  the  product  of  the  number  of  visits  and  the  service  demand  of  the 
corresponding  node  in  the  control-flow  graph.  For  example,  the  number  of  visits  to 


I 


TASK  GRAPH 


PROCESSOR  GRAPH 


FIGURE  4-6:  CONTROL  FLOW  GRAPH  EXAMPLE 
FROM  CHU  AND  LAN,  1987 


FIGURE  4-7:  MODIFIED  TASK  GRAPH 
FOR  FIGURE  4-6 
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node  2  in  the  control-flow  graph  is  1+0.2+0.22+0.23+....  =1.25.  The  service  demand  of 
task  2  in  the  task  graph  is  1.25x1000=1250.  Similarly,  the  service  demand  of  task  3 
in  the  task  graph  is  equal  to  the  number  of  visits,  1.25x0.5=0.625,  times  the  service 
demand  per  visit,  100,  which  is  62.5.  After  100  arrivals,  the  task  graph  is  shown  in 
Fig.  4-7.  The  link  parameters  are  assumed  to  be  1400  or  1500.  The  completion  time  of 
the  mapping  provided  by  the  heuristic  algorithm  is  316250  which,  again,  is  equal  to 
the  level  of  the  start  task  (or  the  minimum  completion  time)  and  is  therefore  optimal. 

The  sixth  example  shown  in  Fig.  4-8  (a)  is  similar  to  the  task  graph  considered 
by  Stone  [1977].  We  modify  the  task  graph  into  a  directed  graph  such  that  a  directed 
edge  starts  from  a  lower  numbered  task  to  a  higher  numbered  task  as  shown  in  Fig. 
4-8  (b);  note  that  task  7  is  a  dummy  terminal  task.  The  service  demands  and  the 
amount  of  data  transmitted  among  tasks  are  the  same  as  those  in  Stone’s  example. 
Note  that  the  computer  system  used  by  Stone  is  a  heterogeneous  system  wherein  the 
processing  time  of  tasks  does  not  correspond  to  a  uniform  system.  We  assume  that 
the  processors  have  identical  service  rates  and  the  service  demands  of  a  task  is 
different  on  different  processors.  The  level  of  a  task  is  determined  from  the  level  by 
taking  the  lowest  service  demand  for  each  task.  For  example,  the  service  demands  of 
task  1  is  (5,10),  which  means  the  service  demand  is  5  when  task  1  is  assigned  to  pro¬ 
cessor  1  and  10  when  task  1  is  assigned  to  processor  2.  We  take  the  service  demand 
of  task  1  as  5  when  computing  the  levels  of  tasks  in  the  task  graph.  Similarly,  take 
service  demand  of  task  2  as  2,  4  for  task  3,  3  for  task  4,  and  so  on.  The  levels  of 
tasks  1  to  7  are  (13,  8,  6,  5,  2,  4 , 0)  and  the  priority  list  of  the  heuristic  algorithm  is 
(1,  2,  3,  4  ,6  ,  5,  7).  The  mapping  provided  by  the  heuristic  algorithm  is:  Ti={l,  2,  3  , 
4  ,  5,  7}  and  r2={6).  The  completion  time  of  the  heuristic  algorithm  is  22.  The  com¬ 
putation  time  plus  communication  time  is  26+12=38,  which  is  the  same  as  Stone’s 
result. 
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FOR  FIGURE  4-8(A) 

FIGURE  4-8:  TASK  GRAPH  TAKEN  FROM  STONE,  1977 
AND  ITS  MODIFIED  TASK  GRAPH 

The  final  task  graph,  shown  in  Fig.  4-9,  has  not  been  addressed  in  the  literature. 
Note  that  the  processors  have  different  service  rates  and  the  links  have  different  capa¬ 
cities.  The  level  of  the  start  task  is  the  sum  of  the  service  demands  on  the  critical 
path,  210,  divided  by  the  fastest  service  rate,  2,  which  is  105  in  this  case.  The  levels 
of  the  other  tasks  can  be  derived  from  equation  3-7.  The  priority  list  of  tasks  is  con¬ 
structed  according  to  nonincreasing  levels  first,  and  nonincreasing  successors,  if  tasks 
have  the  same  levels.  The  execution  order  then  is  (1,  2,  5,  4,  3,  6,  7,  8)  and  the 
corresponding  levels  are  (105,  100,  100,  50,  50,  45,  45,  0).  The  highest  level  task,  task 
1,  is  assigned  to  one  of  the  fastest  processors,  processor  2,  with  a  completion  time  of  5 
time  units.  The  second  task  to  be  assigned  is  task  2  in  this  case.  We  assume  that  data 
communication  takes  place  along  the  shortest  delay  paths.  For  example,  suppose  we 
send  one  unit  of  data  from  processor  2  to  processor  4.  If  the  routing  path  is  (2,3,4)  the 

communication  time  is  —  +  —  =  =1.  Instead,  if  the  routing  path  is  (2,1,4)  the 

c  23  c  a  2  2 
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communication  time  will  be  —  +  —  =  i-  +  —  =  1.25.  Hence,  the  shortest  routing  path 

c2,  cM  14 

should  be  (2,3,4).  If  task  2  is  assigned  to  processor  1,  the  completion  time  will  be  the 
completion  time  of  task  1  plus  the  communication  delay  from  processor  2  to  processor 

1  plus  the  service  time  of  task  2  at  processor  1,  and  is  equal  to  5+-y-+-^-p=125.  Simi¬ 
larly,  if  task  2  is  assigned  to  processor  2,  then  the  completion  time  of  task  2  will  be 
the  completion  time  of  task  1  plus  the  service  time  of  task  2  at  processor  2.  That  is, 

the  completion  time  of  task  2  at  processor  2  is  5  +-^  =  60.  We  can  compute  the  com¬ 
pletion  time  of  task  2  at  processors  3  and  4  from  equations  3-1  and  3-2  in  a  similar 
manner.  The  best  choice  is  processor  2.  The  next  task  will  be  task  5.  Now  if  task  5 
is  assigned  to  processor  2  the  completion  time  of  task  5  will  be  the  completion  time  of 
the  last  task  that  is  assigned  to  processor  2  plus  the  service  time  of  task  5  at  processor 

2  which  is  115.  In  other  words,  if  we  can  find  a  processor  such  that  the  completion 
time  of  task  5  on  that  processor  is  less  than  115,  we  will  assign  to  that  processor.  We 
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FIGURE  4-9:  TASK  AND  PROCESSOR  GRAPH  FOR 
A  HYPOTHETICAL  EXAMPLE 
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find  that  assigning  task  5  to  processor  4  will  give  the  earliest  completion  time  of  task 
5,  so  processor  4  will  be  the  best  choice.  The  results  of  the  mapping  algorithms  are 
shown  in  table  4.1.  The  completion  time  of  the  mapping  provided  by  the  heuristic 
algorithm  is  153.5.  The  pair-wise  exchange  algorithm  provides  a  mapping  with  a  com¬ 
pletion  time  of  115,  which  is  an  improvement  of  over  30%  for  this  example.  The  A* 
(optimal)  algorithm  provides  a  mapping  with  a  completion  time  of  110.5.  Intuitively, 
the  optimal  mapping  should  allocate  tasks  2  and  6  to  processors  2  or  4,  task  3  to  pro¬ 
cessors  1  or  3,  task  4  to  processor  3,  and  tasks  5  and  7  to  processor  2.  This  assignment 
will  balance  the  processing  time  at  different  processors  after  task  1  has  been  com¬ 
pleted.  However,  the  heuristic  or  pair-wise  exchange  algorithms  will  not  compute  this 
mapping. 
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TABLE  4-1:  COMPUTATIONAL  RESULTS  FOR 
DIFFERENT  MAPPING  ALGORTHMS 


4.2  EXPERIMENT  2:  RANDOM  GRAPHS 

In  order  to  understand  the  performance  of  the  heuristic  and  pair-wise  exchange 
algorithms,  we  performed  an  experiment  on  random  task  graphs.  In  this  experiment, 
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the  computer  architecture  is  a  2-cube  system  with  unit  service  rate  at  each  processor 
and  unit  link  capacity.  With  this  architecture,  the  service  demand  of  a  task  and  the 
amount  of  data  transmitted  among  tasks  can  be  treated  as  the  computation  time  of  that 
task  and  the  communication  time  between  tasks  respectively.  The  task  structure  was 
generated  as  follows: 

Algorithm  4.1:  Graph  Generator  Algorithm.  Given  the  ratio  p  of  computation  time  / 
communication  time,  the  graph  generator  algorithm  generates  random,  directed,  acyclic 
task  graphs  with  number  of  tasks  in  the  range  [1,12]. 

Step  1:  Generate  a  random  integer  number  N  in  [1,12] 

Pick  a  start  task  j  in  [1,NJ 
CLOSE=(j] 

Step  2:  For  i=l  to  N  do 

If  ie  CLOSE  then 

Flip  a  coin  to  determine  whether  ie  a j 
end  if 
end  do 
If  la,  1=0  then 
go  to  step  2 
end  if 

Step  3:  Open  :=  a y 

Pick  a  task  k  in  OPEN 
CLOSE=CLOSEUk 
For  i=l  to  N  do 
If  i  e  CLOSE  then 

Flip  a  coin  to  determine  whether  iea* 
end  if 
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end  do 
If  la*  1=0  then 
a*={N+l} 
end  if 

Step  4:  The  node  parameters  are  randomly  selected  in  the  range  [0,p] 

The  link  parameters  are  randomly  selected  in  the  range  [0,1] 

We  define  the  ratio  p  as  computation  time/communication  time,  and  vary  p  in  the 
range  0.001  to  1000.  For  each  value  of  p,  we  generate  one  hundred  random  graphs  as 
the  random  task  graphs,  and  solve  the  mapping  problem  with  the  heuristic,  pair-wise 
exchange,  and  A*  algorithms.  A  notable  result  of  the  experiment  is  that  the  perfor¬ 
mance  of  the  heuristic,  pair-wise  exchange,  and  A*  optimal  mapping  algorithms  are 
dependent  on  the  value  of  p,  i.e.,  the  computation  time/communication  time.  For  large 
values  of  p  (£10)  the  heuristic  algorithm  tends  to  assign  tasks  to  the  first  available 
processor.  That  is,  the  mapping  tends  to  balance  the  work  load  on  each  processor.  In 
this  case,  the  percent  test  cases  for  which  the  heuristic  algorithm  provides  an  optimal 
solution  is  greater  than  95%.  As  p  decreases,  the  percent  of  test  cases  for  which 
optimal  mapping  is  provided  by  the  heuristic  algorithm  decreases  to  75%  at  the  value 
of  p=l  and  then  increases  again.  As  p  falls  below  0.01,  the  heuristic  algorithm  provides 
optimal  mapping  for  all  random  test  cases.  That  is,  the  worst  case  performance  of  the 
heuristic  algorithm  occurs  when  p=l  (i.e.,  when  the  communication  time  is  approxi¬ 
mately  equal  to  computation  time).  The  pair-wise  exchange  algorithm  provides  optimal 
mapping  in  over  95%  of  the  test  cases  except  when  p  is  near  1,  where  it  provides 
optimal  mapping  in  85%  of  the  test  cases.  The  deviation  of  completion  time  of  the 
heuristic  algorithm  to  the  optimal  mapping  may  be  very  large  at  a  small  value  of  p 
(<0.01 )  because  the  communication  time  between  a  pair  of  tasks  happens  to  be  very 
small,  and  then  becomes  very  large  for  the  other  pairs  of  tasks  as  compared  to  the 
computation  time.  Since  the  heuristic  algorithm  always  assigns  a  task  to  a  processor 
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that  has  the  earliest  completion  time  of  the  assigned  task,  a  local  optimal  mapping  may 
result  in  large  amounts  of  communication  time  when  large  amounts  of  data  need  to  be 
transferred  between  two  tasks  in  certain  test  cases.  As  for  the  pair-wise  exchange 
algorithm,  the  deviation  of  completion  time  to  optimal  completion  time  is  very  small 
(below  2%)  for  almost  all  the  test  cases.  The  percentage  of  optimal  mapping,  average 
error  versus  different  ratios  of  p  are  plotted  in  Fig.  4-10  .  As  for  the  optimal  (A*) 
algorithm,  the  number  of  generated  nodes  and  number  of  backtracks  are  small  for 
small  values  of  p  (<0.1)  or  large  values  of  p  (£5).  A*  algorithm  generates  maximum 
number  of  nodes  and  backtracks  when  p=l. 


FIGURE  4,10:  PERCENTAGE  OF  OPTIMAL  MAPPING 
AND  AVERAGE  RELATIVE  ERROR 
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4.3  EXPERIMENT  3:  APPLICATION  TO  WEAPON  TARGET  ASSIGNMENT 
PROBLEM 

This  experiment  involves  the  weapon  target  assignment  and  target  sequencing 
(WTA/TS)  problem  arising  in  the  boost  phase  battle  management  As  shown  in  Fig. 
4-11,  the  weapon  target  assignment  can  be  decomposed  into  a  four  level  optimization 
problem  by  making  use  of  the  special  features  of  the  problem  structure.  The  four  lev¬ 
els  are:  (1)  grouping  targets  into  clusters  based  on  their  proximity  to  one  another  and 
the  distinct  launch  tubes  to  which  each  target  belongs,  (2)  determining  the  optimal 
allocation  of  directed  energy  weapons  (DEWs)  to  the  target  clusters  defined  by  level  1, 
(3)  prescribing  the  optimal  assignment  of  DEWs  to  individual  targets  within  a  cluster, 
and  (4)  determining  the  optimal  fire  control  sequence  of  each  target  for  each  DEW. 

The  first  level  (target  cluster  definition)  problem  is  formulated  as  a  cluster  median 
problem,  which  partitions  the  targets  (both  current  and  future  expected)  into  groups 
such  that  the  total  sum  of  distances  (or  equivalently,  slew  times)  in  a  cluster  to  the 
cluster  median  problem  is  minimized.  The  cluster  median  problem  is  a  0-1  integer  pro¬ 
gramming  problem,  which  is  solved  using  Lagrangian  relaxation  techniques.  The 
second  and  third  levels  (weapon-cluster  allocation  problem  and  weapon-target  assign¬ 
ment  within  a  cluster)  are  formulated  as  mixed-integer  linear  programming  problems 
with  the  objectives  of  minimizing  the  leakage,  balancing  the  allocation  load  among  the 
weapons,  and  minimizing  allocations  that  require  large  slew  (switch-over)  times.  The 
problems,  again,  are  solved  via  Lagrangian  relaxation  techniques  [Korn  et  al.,  1986], 
Finally,  the  fourth  level  (target  sequencing  problem)  is  formulated  as  one  of  minimiz¬ 
ing  the  weighted  tardiness  (i.e.,  value  of  leaked  boosters)  with  sequence  dependent 
setup  (i.e.,  slew)  times.  The  problem  is  solved  via  an  approximate  polynomially  com¬ 
plete  algorithm  due  to  Ullman  [1975]  and  Sahni  [1976]. 

There  are  at  least  three  salient  features  of  the  multi-level  WTA/TS  algorithm  that 
are  worth  noting.  First,  the  algorithm  contains  both  serial  and  parallel  subalgorithms. 
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For  example,  when  viewed  as  a  whole,  the  cluster  definition  and  the  weapon-target 
allocation  within  a  cluster  and  the  target  sequencing  problems  (i.e.,  levels  3  and  4)  are 
parallelizable.  In  addition,  portions  of  the  Lagrangian  relaxation  approach  used  to  solve 
the  level  1  and  level  2  problems  are  parallelizable.  The  key  question  then  is:  what  con¬ 
stitutes  a  task  of  the  WTA/TS  algorithm,  i.e.,  is  it  a  subalgorithm  or  part  of  the  subal¬ 
gorithm?  It  is  our  contention  that  tasks  should  be  defined  as  (appropriate)  portions  of 
subalgorithms  to  extract  maximum  parallelism.  Second,  the  lower  level  subalgorithms 
have  to  periodically  communicate  with  the  higher  level  algorithms.  Third,  there  exists 
time  scale  separation  among  the  various  levels,  in  the  sense  that  higher  level  optimiza¬ 
tion  subalgorithms  are  executed  less  frequently  than  the  lower  level  optimization  subal¬ 
gorithms.  For  example,  the  target  cluster  definition  and  weapon-cluster  allocation 
subalgorithms  (i.e.,  levels  1  and  2)  are  executed  every  1-2  minutes  or  whenever  a  new 
launch  occurs.  The  level  3  (i.e.,  the  weapon-target  allocation  within  a  cluster)  subalgo¬ 
rithm  is  executed  every  10-30  seconds  to  one  minute,  while  level  4  target  sequencing 
subalgorithm  is  executed  every  10-30  seconds.  Consequently,  lower  level  subalgo¬ 
rithms  impose  more  frequent  workload  on  the  fault-tolerant  computer  architecture  than 
higher  level  ones. 

Fig.  4-12  shows  a  task  graph  for  the  WTA/TS  algorithm  wherein  each  node 
represents  a  subalgorithm.  However  each  subalgorithm  can  be  a  task  graph  in  itself,  if 
parallelism  is  exploited  at  a  lower  level  of  granularity.  Once  the  level  of  granularity  is 
selected,  each  task  (or  node)  of  the  task  graph  is  characterized  by  the  number  of 
instructions  of  each  type  of  operation  to  be  executed,  and  each  link  denotes  the  amount 
of  data  (in  bits)  to  be  transferred  between  the  tasks.  We  use  the  following  notations  in 
the  sequel. 


term 

meaning 

NT 

number  of  targets 

NW 

number  of  weapons 

NG 

number  of  clusters 

NTk 

number  of  targets  in  each  cluster 

NWk 

number  of  weapons  assigned  to  cluster  k 

NT h. 

number  of  targets  assigned  to  weapon  w  in  cluster  k , 

*=1,2,  NG\  w=l,  2,..,  NWk 

ND 

number  of  dual  iterations 

TABLE  4-2:  DEFINITION  OF  WTA  VARIABLES 
The  service  demands  of  tasks  in  the  task  graph  can  be  approximated  as  follows: 

task  1:  ND*NT+4*NT  (clustering  algorithm) 

task  2:  ND*NW*  {2*NG  +  3)  +  (NG*NW)2  (weapon  target  clustering  allocation  algorithm) 
task  3.k:  ND*NTk*NWk  +  (NTk*NWk)2  k=l,  2, ...  iVG  (weapon-target  assignment 
within  a  cluster) 

task  4.k.w:  (NTkw)ord'r  *=1,  2, ...  NG\  w=l,  2,..,  Mk\  order  is  the  computational 
complexity  of  the  fire  control  sequencing  algorithm 

The  memory  requirement  of  each  task  in  the  task  graph  is  as  follows: 

task  1  :  VNT2  +  2*NT 

task  2  :  5 *NT*NW 

task  3.k  :  4*NTk*NWk  *=1,2,..JVG 

task  4.k.w:  2  *’  for  optimal  solution 

4*ATiH,  for  weighted  shortest  processing  time  rule 

Approximate  link  parameters:  volume  of  data  transfers  between  tasks  i  and  j,  v,,  (meas¬ 
ured  in  number  of  words)  are: 
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edge  <1,2>  :  NG  +4*NT+4*NT*NW 

edge  <2,3.k>  :  4*NTk*NWk+2*NTk+NWk  ;  k= 1,  2.  ...  NG 

edge  <3.k,4.k.w>:  4*NTk,  k=l,  2,  ..,NG\  w=i,2,  ...AW* 

We  assume  that  the  weapon  target  assignment  algorithm  parameters  are  as  follows: 

(1)  number  of  weapons  in  each  cluster  is  uniformly  distributed  in  [1,5] 

(2)  number  of  targets  for  each  weapon  is  uniformly  distributed  in  [1,50] 

(3)  number  ot  dual  iterations,  ND =100 

The  number  of  clusters  and  order  of  fire  control  computational  complexity  are  variable 
parameters.  The  processor  graph  is  a  3-cube  computer  system  with  service  rate  and 
link  capacity  taken  from  the  real  N-cube  computer,  i.e.,  the  service  rate  at  each  proces¬ 
sor  is  0.5  Mflops  and  the  link  capacity  of  each  link  is  10  Mbits/sec.  In  order  to  under¬ 
stand  the  dominant  factors  of  the  weapon  target  assignment,  we  made  35  runs  with 
the  number  of  clusters  in  the  range  [1,7]  and  the  complexity  of  fire  control  sequencing 
algorithm  as  a  function  of  the  number  of  targets,  AT£,  with  a  in  [1,5].  The  test  results 
are  shown  in  table  4.3.  It  is  clear  from  the  results  that  the  completion  time  increases 
considerably  as  a  changes  from  4  to  5,  which  means  that  the  maximum  admissible 
computation  order  of  the  fire  control  computational  complexity  should  be  less  than  or 
equal  to  4.  Otherwise  it  will  take  several  minutes  to  process  a  subalgorithm  and  will 
not  meet  the  real  time  requirements  of  the  boost-phase  WTA/TS  algorithm. 

4.4  EXPERIMENT  4:  APPLICATION  TO  MULTITARGET  TRACKING 

Ballistic  Missile  Defense  and  Airborne  Surveillance  require  identification  and 
tracking  of  several  hundred  targets  in  real  time.  Multitarget  tracking  algorithms 
designed  for  these  problems  demand  large  computational  resources  which  geneially 
can’t  be  fulfi’led  with  conventional  von  Neumann  types  of  processors.  However,  due  to 
the  simultaneous  tracking  of  several  targets,  the  multitarget  tracking  algorithm  will 
contain  several  computational  tasks  which  may  be  run  in  parallel.  With  the 


percentage  of  optimal  solution  using  the  heuristic 
algorithmin  in  those  thirty-five  runs  of  randomly  =  100  % 
generated  weapon  target  assignment  graph 
.5M  FLOPS;  10M  BUS/SEC  ;  3  CUBE 
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3 

4 

5 

6 

7 

i 

1.0000 

1.0655 

1.1477 

1.1559 

1.1830 

1.4597 

1.6944 

2 

1.0000 

1.0917 

1.1563 

1.1669 

1.1951 

1.4774 

1.7154 

3 

1.0000 

1.6817 

1.3221 

1.4019 

1.4462 

1.7857 

2.2555 

4 

1.0000 

1.5700 

1.7609 

2.1274 

2.2028 

2.7994 

3.0724 

5 

1.0000 

1.3573 

1.5676 

1.9042 

1.9440 

2.4894 

2.4726 

SPEEDUP  VERSUS  #  OF  CLUSTERS 
AND  THE  ORDER  OF  COMPLEXITY 


COMPLETION  TIME  (SECOND) 


ORDER 

NUMBER  OF  CLUSTERS 

1 

T=31 

W=1 

TASK=5 

2 

T=96 

W=4 

TASK=9 

3 

T=208 

W=9 

TASK=15 

4 

T=241 

W=10 

TASK=17 

5 

T=278 

W=12 

TASK=20 

6 

T=355 

W=16 

TASK=25 

7 

T=456 

W=19 

TASK=29 

1 

0.0167 

0.1351 

0.8003 

0.8147 

0.8356 

0.8797 

0.9342 

2 

0.0188 

0.1361 

0.8035 

0.8179 

0.8388 

0.8828 

0.9374 

3 

0.0885 

0.1626 

0.9380 

0.9524 

0.9733 

1.0173 

1.0719 

4 

2.33885 

2.4063 

6.4516 

6.4660 

6.4869 

6.5310 

10.1984 

5 

78.2874 

78.3053 

232.5121 

232.5265 

232.5474 

232.5915 

459.1290 

TABLE  4-3:COMPLETION  TIME  VERSUS  NUMBER  OF  CLUSTERS 
AND  COMPLEXITY  OF  FIRE  CONTROL  ALGORITHM 
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development  of  multiprocessor  architectures,  adapting  multitarget  tracking  algorithms 
to  these  new  computer  architectures  poses  several  challenging  problems  including  a 
good  selection  of  computer  architecture,  task  partitioning,  and  task  allocations. 

Implementation  of  the  multitarget  tracking  algorithm  requires  the  use  of  Kalman 
filters  for  updating  target  tracks  and  maintaining  hypotheses  corresponding  to  combina¬ 
tions  of  likely  target  tracks.  Multitracker  [Kurien,  1986]  uses  the  mathematical  frame¬ 
work  of  Hybrid  State  Estimation  to  formulate  a  solution  methodology  for  the  multitar- 
get  tracking  problem.  The  general  hybrid  state  model  consists  of  continuous  and 
discrete-valued  states.  Using  measurements  related  to  the  hybrid  state,  it  is  possible  to 
compute  an  optimal  (minimum  mean  squared  or  maximum  a  posteriori)  estimate  of  the 
hybrid  state.  1 

One  of  the  current  approaches  toward  this  problem  is  termed  track-oriented 
approach  which  provides  a  systematic  methodology  for  constructing  the  optimal  solu¬ 
tion  for  multitarget  tracking.  However,  for  all  practical  scenarios  which  consist  of 
several  measurements  in  each  scan,  the  computational  requirements  (both  processing 
time  and  memory)  will  deplete  the  resources  of  any  currently  available  computer.  The 
reason  for  this  problem  is  that  the  optimal  algorithm  postulates  and  retains  all  possible 
global  hypotheses  including  the  ones  that  are  only  remotely  probable.  In  order  to  con¬ 
struct  a  practical  algorithm,  all  such  unlikely  global  hypotheses  have  to  be  eliminated. 
The  key  techniques  incorporated  in  multitracker  are  (1)  N-Scan  Approximation,  (2) 
Gating,  (3)  Classification,  (4)  Classification  of  Targets,  and  (5)  Clustering.  Details  are 
discussed  by  Kurien  [1986]  . 

Algorithm  parameters  are  chosen  on  the  basis  of  the  anticipated  scenario.  For 
example,  the  number  of  confirmed  targets  accommodated  by  the  algorithm  should 
correspond  to  the  maximum  number  of  targets  anticipated  within  the  surveillance 

1  V'e  are  grateful  to  Dr.  Thomas  Kurien  for  providing  the  data  and  multi-tracker  algorithm 
of  Fig.  4-13. 
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region;  the  number  of  tracks  permitted  for  each  target  should  take  into  account  the 
clutter  density,  the  proximity  of  other  targets,  and  the  probability  of  detection  for  tar¬ 
gets.  Similarly,  the  number  targets  and  tracks  per  target  permitted  for  Intermediate, 
Tentative,  and  Bom  targets  should  be  based  on  target  birth  and  death  distributions  and 
clutter  distribution. 

The  major  computational  effort  in  multitracker  is  confined  to  three  functional 
steps:  (1)  track  predictions,  (2)  track  updates,  and  (3)  track  pruning.  The  parameters 
used  in  multitracker  are  defined  in  table  4-4: 


term 

meaning 

Ne 

number  of  confirmed  targets 

N, 

number  of  intermediate  targets 

N, 

number  of  tentative  targets 

N„ 

number  of  bom  targets 

Bc 

number  of  tracks  per  confirmed  target 

Bi 

number  of  tracks  per  intermediate  target 

B, 

number  of  tracks  per  tentative  target 

Nr 

number  of  returns  per  scan 

NTC 

number  of  targets  per  connected  cluster 

NGH 

maximum  number  of  global  hypotheses 

NS 

number  of  subclusters 

NTS 

number  of  targets  in  each  subclusters 

NC 

number  of  connected  clusters 

n 

number  of  states  modeled  in  the  Kalman  filter 

m 

number  of  measurements  in  each  return 

TABLE  4-4:  DEFINITION  OF  MULTI-TRACKER  VARIABLES 
The  computational  requirements  of  each  step  are  as  follows: 
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prediction  step  :  \.5(n2  +  n2)  multiplications,  and 
1.5(  n3  -  n  )  additions 

gating  step  :  Nrm  +  m(n2  +2n  )  multiplications,  and 
2 mNr  +  m(n2  +  n  -  1  )  additions 
updating  step  :  1.5(  n3  +  n2 )  multiplications,  and 
1.5(  n2  -  n  )  additions 


clustering  step 


NTS  (NS  (NS  -1) ) 
2 


comparisons 


global  hypotheses:  min[{Bc  +  l)wc  ,NGH  ]  [ ^  comparjsons 

6 

min[(Bc  +  1)^  ,  NGU  }  ( NTC-l )  multiplications 
The  operation  count  for  pruning  and  promotion  of  targets,  and  for  the  creation  of  born 
targets  is  negligible. 


Communication  requirement  for  multitracker  are  listed  as  follows: 
prediction  step  :  32n2+48n+8  bits 
gating  step  :  i6(n2+iVrm)+32(n+w)+8  bits 
updating  step  :  16(/i2+A^f  *m)+32(n+m)  bits 
clustering  step  :  \6NC*BC*  (NSCAN+l)  +  4(NC*BC)  bits 
global  hypotheses:  16NTC*BC* (NSCAN+i)+36*NTC*Bc  bits 


A  directed  task  graph  summarizing  the  steps  executed  in  one  cycle  of  multitracker  is 
shown  in  Fig.  4-13  .  We  assume  the  following  requirements  for  arithmetic  operations, 
time  for  32  bit  multiplication  4  units 
time  for  32  bit  addition  2.6  units 
time  for  16  bit  comparison  1.3  units 

Typical  values  for  variables  are: 

«=4 
m=  3 
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iVc=100 
Bc=  6 
tf=1.5 
A/,  =20 
5,  =3 
a/,  =30 
B,  =3 

A/fc=A/r=100 
NTS  =2 
NS  =5  0 
ATC=4 
A/C  =25 
A/G// =100 
NSC  AN  =2 
Nt  =1.5 

The  associated  computational  requirements  for  various  steps  in  Fig.  4-13  may  then  be 

evaluated  as  follows: 

predict  confirmed  tracks  :  600*  714  time  units 
predict  intermediate  tracks  :  60*  714  time  units 
predict  tentative  tracks  :  90*  714  time  units 
predict  born  tracks  :  100*  714  time  units 
gate  confirmed  tracks  :  600*3196.2  time  units 
gate  intermediate  tracks  :  60*  3196.2  time  units 
gate  tentative  tracks  :  90*  3196.2  time  units 

gate  bom  tracks  :  100*3196.2  time  units 


update  confirmed  tracks  :  900*  825  time  units 
update  intermediate  tracks  :  60*  825  time  units 
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FIGURE  4-13:  DIRECTED  FLOW  GRAPH  OF  OPERATIONS 
IN  ONE  SCAN  OF  TRACKING  ALGORITHM 


FIGURE  4- 14:  TASK  GRAPH  FOR  AN 
ILLUSTRATIVE  EXAMPLE 
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update  tentative  tracks 

:  60* 

825 

time  units 

update  bom  tracks 

:  90* 

825 

time  units 

clustering 

:  3183 

time  units 

global  hypotheses  generation:  25*16380  time  units 

The  associated  data  to  be  transferred  are  shown  as  follows: 
prediction  step  :  712  bits/track 
gating  step  :  5288  bits/track 

updating  step  :  576  bits/track 
clustering  step  :  13200  bits/track 
global  hypotheses  step:  2016  bits/track 

With  these  parameters  listed  as  above,  we  can  vary  the  granularity  of  a  task  and 
then  using  the  algorithms  described  in  section  3,  we  solve  the  associated  mapping 
problem.  If  each  node  of  Fig.  4-13  is  defined  as  a  task,  then  the  task  graph  can  be 
constructed  as  in  Fig.  4.14.  If  we  use  2-cube  computer  as  the  computational  resource 
graph  and  use  the  heuristic  algorithm  to  solve  the  mapping  problem,  we  find  that  the 
completion  time  for  one  scan  of  the  tracking  algorithm  is  3.5045  seconds  and  the 
corresponding  speedup  is  1.3378.  Note  that  the  completion  time  of  the  heuristic  algo¬ 
rithm  is  equal  to  the  level  of  the  start  task  (or  the  maximum  speedup  is  achieved),  i.e., 
no  matter  how  many  processors  are  used  to  solve  this  problem,  the  maximum  speed  up 
is  limited  by  the  level  of  the  start  task.  Nevertheless,  since  the  nodes  of  the  prediction, 
gating,  and  updating  step  are  parallelizable  (or  can  be  decomposed  into  parallel  sub¬ 
tasks),  we  can  vary  the  granularity  of  a  task  size  to  decrease  the  start  task  level  in 
order  to  get  a  better  speed  up.  In  order  to  understand  the  best  task  size  for  the  multi¬ 
tracker,  we  partition  the  parallelizable  tasks  into  subtasks  thereby  constructing  a  vari¬ 
able  task  graph.  Heuristic  and  pair-wise  exchange  algorithms  are  employed  to  solve 
the  corresponding  mapping  problem.  At  the  beginning,  we  define  50  targets  as  a  task 


-  68  - 


#of  confirmed 
target  per  task 

heuristic  algorithm 

pair-wise  exchange  algorithm 

level 

completion  time 

speedup 

completion  time 

speedup 

100 

3.50 

3.50 

1.34 

3.50 

1.34 

50 

1.96 

2.05 

2.29 

2.05 

2.29 

25 

1.19 

1.27 

3.68 

1.27 

3.68 

20 

1.03 

1.10 

4.25 

1.10 

4.25 

10 

0.72 

0.97 

4.84 

0.93 

5.02 

5 

0.57 

0.85 

5.51 

0.84 

5.60 

4 

0.54 

0.86 

5.43 

0.83 

5.64 

2 

0.48 

0.87 

5.37 

0.82 

5.70 

1 

0.47 

0.86 

5.42 

— 

Table  4-5  (a):  completion  time  and  speedup  versus  number 

OF  TARGETS  FOR  3-CUBE  MULTI- PROCESSOR  SYSTEM 


#  of  confirmed 
target  per  task 

heuristic  algorithm 

pair-wise  exchange  algorithm 

level 

completion  time 

speedup 

completion  time 

speedup 

100 

3.50 

3.50 

1.34 

3.5045 

1.34 

50 

1.96 

2.05 

2.29 

2.0466 

2.29 

25 

1.19 

1.27 

3.68 

1.2744 

3.68 

20 

1.03 

1.49 

3.14 

1.4171 

3.31 

10 

0.72 

1.26 

3.72 

1.2249 

3.83 

5 

0.57 

1.24 

3.79 

1.2212 

3.84 

4 

0.54 

1.23 

3.81 

1.2155 

3.86 

2 

0.48 

1.26 

3.73 

1.2120 

3.87 

1 

0.47 

1.23 

3.83 

— 

Table  4-5  (b):  completion  time  and  speedup  versus  number 
OF  TARGETS  FOR  2-CUBE  MULTI-PROCESSOR  SYSTEM 
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in  the  confirmed  path  with  all  the  other  conditions  fixed,  we  find  the  speed  up 
increases  by  a  factor  of  two.  When  25  targets  are  defined  as  a  task,  the  speed  up  is 
3.68.  If  we  continue  partitioning  the  task  size  into  smaller  sizes  (decrease  the  number 
of  targets  per  task),  we  find  that  the  speed  up  is  limited  to  be  3.85.  It  is  obvious  that 
the  speedup  of  a  4  processor  computer  architecture  is  at  most  4.  Therefore,  we  conjec¬ 
ture  that  the  speed  up  may  be  limited  by  the  architecture.  Due  to  this  reason,  we 
change  the  architecture  to  a  3-cube  computer  architecture  and  make  the  same  runs.  The 
definition  of  10  targets  per  task  resulted  in  a  speedup  of  4.85.  As  the  task  size  become 
smaller,  the  speed  up  curve  levels  off  around  5.4.  The  detailed  results  are  tabulated 
in  table  4-4.  Finally,  we  upgrade  the  architecture  to  a  16  processor  computer  system 
and  repeat  the  partitioning  experiment:  the  maximum  speed  up  is  7.2. 

The  next  experiment  involves  the  selection  of  task  size  in  every  path  to  obtain  a 
maximum  speed  up,  i.e.,  the  task  partitioning  of  four  different  paths  and  global 
hypotheses  simultaneously.  If  a  16  multiprocessor  architecture  is  used,  we  find  the 
maximum  speed  up  for  any  possible  decomposition  of  task  size  to  be  about  11.44  for 
the  case  when  5  targets  make  up  a  task  in  every  path  and  5  connected  clusters  form  a 
task  at  global  hypotheses  step.  Since  the  utilization  of  the  processors  is  0.7,  we  con¬ 
clude  that  increasing  the  number  of  processors  will  not  improve  the  speedup  consider¬ 
ably.  In  other  words,  a  multiprocessor  architecture  with  16  processors  will  be  a  good 
choice  for  solving  the  multitarget  tracking  problem. 

In  order  to  increase  the  speed  up,  tasks  on  the  critical  path  (in  this  case, 
confirmed  path)  should  be  decomposed  further.  In  this  example,  the  decomposition  of 
the  other  paths  will  not  have  a  significant  effect  on  the  speedup.  However,  if  we 
decompose  the  task  to  a  very  small  size  such  that  the  critical  path  is  no  longer  the 
confirmed  path,  or  the  highest  level  of  this  path  is  almost  the  same  as  that  of  any  other 
path,  then  the  decomposition  of  other  paths  may  become  important. 
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4.5  SUMMARY 

In  this  section,  we  performed  four  experiments:  (1)  hypothetical  example,  (2)  ran¬ 
dom  graphs,  (3)  application  to  weapon  target  assignment  problem,  and  (4)  application 
to  multitarget  tracking.  The  first  experiment  involved  the  examples  addressed  in  the 
literature.  These  examples  can  be  applied  to  our  mapping  algorithms  and  the 
corresponding  results  showed  that  the  heuristic  mapping  algorithm  is  a  good  methodol¬ 
ogy.  In  the  second  experiment,  hundreds  of  random  graphs  with  different  ratios  of 
computation  time/communication  time  were  generated  for  test  examples.  The  test  result 
showed  that  the  heuristic  algorithm  provides  optimal  mapping  when  the  ratio  of  com¬ 
putation  time/communication  time  is  large  (>10)  or  small  (<0.1).  The  worst  test  case 
occurred  when  the  computation  time  is  approximately  equal  to  the  communication 
time,  wherein  the  heuristic  algorithm  provides  optimal  mapping  in  only  75%  of  the  test 
cases.  The  pair-wise  exchange  algorithm  provided  optimal  mapping  in  over  95%  for 
almost  all  values  of  of  computation  time/communication  time.  In  the  third  experiment, 
the  mapping  of  a  tree- structured  weapon  target  assignment  problem  was  considered. 
The  service  demands  and  amount  of  data  transmitted  among  tasks  were  related  to 
number  of  clusters,  number  of  weapons,  number  of  targets,  and  fire  control  complexity 
(a).  The  range  of  a  in  [1,5]  together  with  the  range  of  number  of  clusters  in  [1,7] 
were  studied  in  this  experiment.  The  results  showed  that  unless  the  fire  control  com¬ 
plexity  parameter,  a,  is  less  than  or  equal  to  four,  it  will  not  satisfy  the  real-time 
requirements  of  boost-phase  WTA/TS.  The  percent  of  optimal  mapping  provided  by 
the  heuristic  algorithm  was  100%  in  these  35  runs.  The  last  experiment  involved  the 
multi-tracker  algorithm  for  multitarget  tracking.  In  this  experiment,  the  granularity  of  a 
task  size  and  the  selection  of  a  computer  architecture  for  the  multi-target  tracker  prob¬ 
lem  were  studied.  The  results  showed  that  the  critical  path  (confirmed  path)  dominates 
most  of  the  computation  time  and,  hence,  the  size  of  granularity  for  a  task  on  the  criti¬ 
cal  path  and  an  appropriate  computer  architecture  should  be  carefully  selected  in  order 
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to  achieve  the  maximum  speedup  for  the  multi-tracker  algorithm. 
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SECTION  5  CONCLUSIONS  AND  FUTURE  RESEARCH  WORK 
5.1  CONCLUSIONS 

In  this  report,  we  formulated  the  mapping  problem  as  one  of  characterizing  a 
BM/C3  algorithm  and  the  multi-processor  architecture  as  graphs  and  of  assigning  and 
sequencing  the  nodes  of  the  algorithm  graph  to  the  processor  graph  to  minimize  the 
completion  time  of  the  algorithm,  subject  to  reliability,  storage,  and  security  con¬ 
straints.  The  formulation  explicitly  considered  precedence  constraints,  inter-task  com¬ 
munication,  security,  storage,  and  reliability  constraints.  In  addition,  the  processors  are 
assumed  to  be  uniform  with  different  service  rates  and  link  capacities.  A  directed, 
acyclic  task  graph  is  used  to  denote  an  application  algorithm  whereas  an  undirected 
processor  graph  is  used  to  denote  multi-processor  computer  architecture. 

The  mapping  problem  is  NP-complete,  i.e.,  the  memory  and  computational 
requirements  of  an  optimal  solution  becomes  intractable  as  the  number  of  tasks  and  the 
number  of  processors  becomes  large.  Therefore,  all  practical  algorithms  incorporate 
heuristics.  In  section  3,  we  derived  four  mapping  algorithms  viz.,  heuristic,  pair-wise 
exchange,  optimal  (A*),  and  .4*  mapping  algorithms.  The  key  mapping  equations  that 
describe  the  evolution  of  completion  time  as  a  function  of  mapping  are  also  derived  in 
this  section.  These  equations  form  the  basis  of  all  four  mapping  algorithms.  The 
heuristic  algorithm  consists  of  two  stages.  The  first  stage  employs  the  critical  path 
method  to  determine  task  execution  order  while  the  second  stage  sequentially  allocates 
the  tasks  from  the  ordered  list  to  minimize  the  completion  time.  The  pairwise 
exchange  algorithm  improves  the  performance  of  the  heuristic  algorithm  by  iteratively 
exchanging  the  order  of  elements  in  the  priority  list,  without  violating  precedence 
constraints.  Since  the  heuristic  and  pair-wise  exchange  algorithms  do  not  guarantee  an 
optimal  solution,  we  developed  an  optimal  (A*)  algorithm.  The  A*  algorithm  searches 
for  the  optimal  solution  from  a  decision  tree.  This  tree  is  created  by  considering  all 
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possible  sequencing  orders  of  execution  and  allocations.  A  path  starting  from  the  root 
node  and  ending  at  the  goal  node  corresponds  to  the  optimal  mapping  and  the  task 
sequencing  at  each  processor.  The  completion  time  of  the  terminal  task  at  the  goal 
node  is  also  the  optimal  completion  time  of  the  algorithm.  In  order  to  improve  the 
search  strategy  of  A*  algorithm,  the  optimality  criterion  is  relaxed  in  the  development 
of  the  A't  algorithm.  The  A\  is  e-admissible,  i.e.,  it  always  finds  a  solution  with  com¬ 
pletion  time  that  does  not  exceed  the  minimum  completion  time  by  more  than  (1+e). 

The  algorithms  are  extensively  tested  via  a  set  of  four  computational  experiments 
in  section  4.  Experiment  1  was  made  up  of  hypothetical  examples  taken  from  the 
literature.  Experiment  2  was  concerned  with  the  performance  assessment  of  the  heuris¬ 
tic  and  pair-wise  exchange  algorithms  on  random  graphs.  The  test  results  showed  that 
the  performance  depends  on  the  ratio  of  computation  time/communication  time.  The 
worst  case  performance  of  the  heuristic  and  pair-wise  exchange  algorithms  occurs 
when  the  computation  time  is  approximately  equal  to  the  communication  time.  It  was 
observed  that  the  heuristic  algorithm  provides  optimal  mapping  in  75%  of  the  test 
cases  and  the  pair-wise  exchange  algorithm  provides  optimal  mapping  in  over  85%  of 
the  test  cases  in  this  case.  When  the  ratio  of  computation  time/  communication  time  is 
high  or  low  (>  10  or  5  0.1),  the  heuristic  and  pair-wise  exchange  algorithms  provide 
optimal  mapping  in  almost  100%  of  the  test  cases.  Experiment  3  was  related  to  the 
application  of  the  mapping  algorithm  to  WTA/TS  problem  arising  in  the  boost-phase 
battle  management.  This  problem  can  be  decomposed  into  a  4-level  optimization  prob¬ 
lem.  The  task  graph  constructed  according  to  the  4-level  optimization  problem  results 
in  a  tree-structured  graph.  We  varied  the  number  of  clusters  in  the  range  [1,7]  and  the 
computational  order  of  fire-control  sequencing  algorithm  in  the  range  [1,5].  We  found 
that  the  order  of  fire  control  sequencing  algorithm  should  be  less  than  or  equal  to  4. 
Otherwise,  it  will  not  meet  the  real-time  requirements  of  the  boost-phase  BM/C\  In 
experiment  4  we  applied  the  mapping  strategy  to  a  multi-target  tracking  problem.  It 
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was  observed  that  the  completion  time  of  the  multi-target  tracking  algorithm  is  limited 
to  the  critical  path  of  the  task  graph  in  some  test  cases.  In  order  to  maximize  speed 
up,  the  granularity  of  a  task  should  be  carefully  selected  because  inappropriate  decom¬ 
position  of  a  task  may  cause  additional  communication  problems.  Furthermore,  we 
also  discussed  the  issues  of  selecting  a  computer  architecture  for  the  multi-target  track¬ 
ing  algorithm  in  this  section. 

5.2  FUTURE  RESEARCH  WORK 

The  future  work  involves:  (1)  quasi-static  task  allocation,  (2)  dynamic  task  alloca¬ 
tion,  and  (3)  task  partitioning.  The  problem  considered  in  this  report  was  restricted  to 
static  task  allocation  problem,  where  the  task  graph  G,(V,  ,  £,)  and  the  processor  graph 
Gp(Vp  ,  Ep)  are  time-invariant.  The  quasi-static  task  allocation  problem  is  concerned 
with  the  mapping  of  a  sequence  of  task  graphs  G,(V,  ,  £,  ,t0,t ,  ...  r,.,  ,  t,  ...  /„)  onto  a 
sequence  of  processor  graphs  Gp(Vp  ,Ep,t0, /,  ...  /,_!  ,  f,  ...  /„).  The  task  graph  may 
change  due  to  the  completion  of  existing  tasks  or  arrival  of  new  tasks.  The  processor 
graph  may  change  due  to  the  failures,  and/or  recovery  of  processors  and/or  communi¬ 
cation  links.  However,  in  a  given  rime  interval  ,/,•),  i=l  n,  the  task  graph 
G,(V,  .£,)  and  the  processor  graph  Gp(Vp  ,  Ep)  are  stationary.  This  problem  can  be 
solved  by  solving  a  sequence  of  static  mapping  problems  in  intervals 
Oo .  r  i ) .  (/ 1  , 1 2)  -  ('.-i  .  *,-)•••(/*- 1  .  O.  while  taking  into  account  migration  delays. 

The  dynamic  task  allocation  problem  involves  the  mapping  of  a  task  graph 
G,(V,  ,  £,  ,  t)  onto  a  processor  graph  GP(VP  ,EP  ,  t)  where  task  and  processor  graphs  are 
time  varying.  The  time  varying  feature  of  the  processor  graph  is  associated  with 
automatic  fault-detection,  isolation,  and  reconfiguration  strategies.  To  solve  this  prob¬ 
lem,  we  must  consider  the  effects  of  current  mapping  decisions  at  time  t  on  all  future 
assignments  and  processor  configurations.  This  multi-stage  optimization  problem  can 
be  approached  via  approximate  dynamic  programming. 
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Task  partitioning  is  related  to  the  granularity  of  tasks  and  domain  decomposition. 
The  partitioning  of  a  task  into  several  parallel  subtasks  may  increase  or  decrease  the 
speed  up  of  an  algorithm  due  to  the  parallelism  and  the  communication  among  those 
subtasks.  The  best  task  partitioning  should  provide  us  a  task  graph  that  maximizes  the 
speedup  on  a  given  computer  architecture. 


-  76  - 


APPENDIX  A 

A.l  MAPPER  USER  INTERFACE 

MAPPER  is  an  interactive  graphical  environment  to  analyse  task  allocation  in 
multiprocessor  systems.  It  is  a  design  tool  which  allows  the  user  to  evaluate  alternative 
partitioning  and  mapping  strategies  onto  various  computer  architectures  in  a  sys¬ 
tematic,  intutive  and  interactive  manner. 

MAPPER  is  a  user-friendly  package.  The  user  sketches  the  task  graph  (computa¬ 
tional  flow  graph  -  CFG)  and  the  processor  graph  (computational  resource  graph  - 
CRG)  using  the  mouse.  Then,  the  user  can  configure  the  program,  by  selecting 
appropriate  alogrithms  and  perform  the  mapping  of  CFG  onto  the  CRG.  The 
MAPPER  displays  the  results  in  the  form  of  a  Gantt  chart  along  with  other  perfor¬ 
mance  measures  (speed-up,  completion  time  and  processor  utilization). 

MAPPER  is  hosted  on  the  SUN  workstation  and  can  be  executed  in  Suntools 
environment.  The  user  interface  for  the  MAPPER  conforms  to  any  suntools  application 
program  (iconedit,  mailtool  etc.). 

A. 2  DESCRIPTION 

MAPPER  is  a  flexible  design  environment  that  allows  the  user  to  switch  back  and 
forth  through  different  functions.  Specifically  the  program  has  backtracking  and  editing 
capabilhics,thereby  allowing  the  user  to  fine-tune  a  particular  problem  rapidly  without 
having  to  leave  the  environment. 

MAPPER  is  like  any  suntool  application.  It  is  made  up  of  six  overlapping  windows. 

1.  Main  Command  window. 

2.  Task  graph  window. 

3.  Processor  graph  window. 

J.  Algorithm  selector  window 
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FIGURE  A-i:  A  SCREEN  ENCOUNTERED  IN  A  TYPICAL  MAPPER  WORK  SESSION 
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5.  C-shell. 

6.  Results  window. 

A  typical  screen  encountered  in  a  MAPPER  session  is  shown  fig.  A-l.  The  figure 
shows  the  Task  Graph  window,  the  top  part  of  the  Processor  graph  window,  part  of  the 
C-shell  and  the  Results  window  in  the  foreground. 

The  MAPPER  works  as  follows.  After  invoking  the  program  the  user  clicks  on 
the  main  selction  panel  to  open  the  task  graph  window.  The  task  graph  can  then  be 
drawn  on  the  screen  using  the  mouse.  The  data-entry  window  pops  up  automatically 
whenever  a  node  or  a  link  data  is  required.  The  task  graph  window  has  editing  features 
(add  nodes,  links;  delete  nodes,  modify  numerical  data,  destroy  the  whole  graph).  The 
processor  graph  window  is  functionally  identical  to  the  window  described  above  except 
that  the  user  has  a  choice  of  either  opening  this  window  in  a  bus-architecture  or  a  link 
architecture  mode. 

The  main  selection  has  LOAD/SAVE  features  for  both  the  graph  windows.  This 
feature  is  particularly  useful  to  create  a  library  of  processor  graphs.  Standard  architec¬ 
tures  can  be  loaded  when  required  instead  of  having  to  draw  them  each  time  the  pro¬ 
gram  is  used. 

The  algorithm  selector  window  configures  the  mapping  algorithms  to  operate  in 
the  "with  constarints  mode"  or  the  "without  constraints  mode".  After  the  selection  is 
made,  the  user  can  specify  the  algorithm  to  be  used  to  perform  the  task  allocation. 
MAPPER  computes  the  task  allocation  and  presents  the  results  in  the  C-shell  window. 

The  visual  display  can  be  invoked  by  clicking  on  the  main  selection  panel.  The 
final  result  is  displayed  as  Gantt  chart  for  each  processor  with  the  relevant  perfor¬ 
mance  measures. 

The  user  can  cycle  through  the  stages  described  above  interactively.  This  coupled 
with  th'*  editing  features  allows  the  user  to  compare,  analyse  and  evaluate  the  mapping 
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by  incrementally  modifying  CRGs,  CFGs  or  by  using  different  mapping  algorithms. 

A. 3  SOFTWARE  DESCRIPTION 

MAPPER  is  written  as  a  two-level  program. 

1.  Mathematical  algorithms 

2.  Graphical  interface. 

The  mathematical  algorithms  consist  of  Heuristic,  pair-wise  exchange,  A-star  and 
approximate  A-star  mapping  algorithms  and  are  implemented  in  FORTRAN.  These 
algorithms  can  operate  independently  of  the  graphical  interface  with  a  keyboard-screen 
based  I/O  or  input  through  ASCII  datafiles.  This  algorithm  program  forms  the  basis  of 
MAPPER. 

The  graphical  interface  is  a  Sunview  based  graphical  shell  written  in  C.  The 
graphical  program  enables  the  user  to  access  the  mathematical  algorithms  in  an  interac¬ 
tive  mode.  Instead  of  inputting  the  data  in  numerical  format,  the  user  can  draw  the 
CRG  and  the  CFG.  The  interface  converts  these  drawings  into  an  ASCII  datafile, 
which  is  read  by  the  FORTRAN  program.  Similarly  the  result  file  from  the  algorithm 
program  is  read  by  the  graphics  program  and  converted  into  graphical  results.  The 
graphical  program  also  serves  to  make  the  whole  package  into  an  environment  where 
the  user  can  work  interactively. 
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APPENDIX  B 

t 

USER’S  MANUAL: 

MAPPER  is  a  mouse-driven,  window  based  suntool  application.  The  reader  of  thi 
appendix  is  assumed  to  have  working  knowledge  of  any  suntool  application  like 
Dbxtool  or  Iconedit  etc. 

Section  B.  1  describles  the  compilation  procedure  for  the  source  code.  This  would 
be  required  only  if  the  source  code  is  modified.  Section  B.2  desribes  theusage  of  the 
program  including  the  definition  and  function  of  all  the  panel  buttons. 

B.  1  COMPILATION 

The  source  code  can  be  compiled  as  follows, 

cc  -o  mapper  mapper.c  -lsunwindow  -lsuntool  -lpixrect  -lm 

B.2  PROGRAM  DESCRIPTION 

Run  the  program  from  suntools  by  typing  mapper. 

Make  sure  that  the  algorithms  program  "alloc.out"  is  in  your  directory. 

Associated  with  each  screen  is  a  panel  displaying  choice  buttons.  The  user  can 
activate  a  particular  function  by  clicking  on  the  corresponding  choice  button. 
Choice  Button  functions:  a)  Main  selection  window: 

1.  TASK-GRAPH:  Clicking  on  this  button  opens  the  task  graph  (CFG)  win¬ 
dow.  The  graph  can  be  drawn  on  this  canvas-window.  The  edit,  delete  and 
draw  button  acivate  the  respective  modes  of  mouse  operations.  QUIT  closes 
the  window  and  CLEAR  destroys  the  window  and  the  data  associated  with 
it. 

2.  PROCESSOR-GRAPH  LINKS:  opens  the  processor  graph  window.  This 
window  is  to  be  used  to  draw  message  passing  architectures.  The  functions 
are  identical  to  those  of  Task  graph  window. 
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3.  PROCESSOR-GRAPH  BUS:  is  a  similar  window  used  to  draw  bus- 
architectures. 

4.  RUN:  generates  the  files  required  for  the  algorithm  program  and  displays  the 
algorithm  selection  panel. 

5.  DISPLAY:  clicking  on  this  opens  the  display  window  and  displays  the 
mapping  results.  This  window  should  be  opened  after  the  mapping  program 
alloc.out  has  terminated. 

6.  SAVE  CRG/  save  CFG:  saves  the  CRG/CFG  screens  and  the  data.  Make 
sure  that  you  enter  the  CRG  filename  and  CFG  filname  before  invoking  this 
function. 

7.  LOAD  CRG/  LOAD  CFG:  performs  the  corresponding  retrieval  of  the  CRG 
and  the  CFG  screens/data  from  the  named  files. 

8.  QUIT:  destroys  all  the  windows  and  exits  to  suntools.  All  data  is  lost  if  not 
saved  prior  to  QUIT. 

The  main  selection  panel  and  the  algorithm  selection  panel  are  shown  in  fig.  B-l. 
B.3  DRAWING  GRAPHS: 

The  procedure  to  draw  the  task  graph  and  the  processor  graph  is  similar. 

On  entering  the  particular  window  (task,  processor)  make  sure  that  the  draw  but¬ 
ton  is  highlighted  that  is  you  are  in  the  draw/add  mode. 

Clicking  the  MIDDLE  button  of  the  mouse  draws  a  node  (circle)  at  the  mouse- 
pointer.  The  data-entry  window  pops  up.  The  user  has  to  enter  data  and  click  on  OK 
before  proceeding. 

After  drawing  more  than  two  nodes  a  link  can  be  drawn  by  clicking  on  the  LEFT 
button  at  the  two  nodes  to  be  linked.  Enter  the  data  as  usual  and  close  the  data  win¬ 
dow  by  clicking  OK. 


FIGURE  B-l:  MAIN  SELECTION  WINDOW,  ALGORITHM 
SELECTION  PANEL  AND  C- SHELL 
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EDIT  mode  is  meant  to  modify  data  associated  with  cither  a  node  or  a  link.  On 
entering  this  mode  (highlight  edit  button  by  clicking  on  it),  click  the  MIDDLE  button 
at  a  node.  The  data  entry  window  pops  up.  Modify  required  entries  and  close  the  pop¬ 
up  window.  Clicking  the  LEFT  button  at  two  nodes  will  make  the  data-entry  window 
associated  with  that  link  to  pop-up.  Modify  entry  and  close  the  pop-up  window. 

The  data  entry  window  for  the  task  node  is  shown  in  fig  B-2  and  for  the  proces¬ 
sor  node  in  fig  B-3. 

DELETE  mode  is  used  for  modifying  the  graph  by  deleting  nodes.  After  entering 
this  mode  (by  clicking  on  delete),  clicking  the  MIDDLE  button  at  a  node  results  in  the 
deletion  of  that  node  and  the  links  associated  with  it.  The  node  indices  are  renum¬ 
bered. 

ALGORITHM  SELECTION  PANEL:  This  panel  allows  the  user  to  select  the 
mapping  algorithm  (with  or  without  constraints)and  finally  execute  it  by  clicking  OK. 

The  C-shell  window  allows  access  to  UNIX  commands  without  having  to  leave 
the  environment.  The  way  to  invoke  other  suntool  applications  while  in  MAPPER  is  to 
rurv  the  other  application  as  a  background  process.  For  example  to  invoke  the  text  edi¬ 
tor  type  "texteditor  &"  in  the  C-shell  window.  The  editor  will  be  overlayed  over  the 
MAPPER  window. 

The  screendump  (hardcopy  of  the  screen)  can  be  made  by  entering  the  follwoing 
command  in  the  C-shell.  (note:  make  sure  that  the  screen  does  not  change  during  this 
process) 


screendump  I  rasfi!ter8tol  1  lpr  -v 
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FIGURE  B-2:  TASK  NODE  ENTRY  WINDOW 


FIGURE  B-3:  PROCESSOR  DATA  ENTRY  WfNDOW 
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